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Preface 



Uncertainty is an increasingly important research topic in many areas of com- 
puter science. Many formalisms are being developed, with much interest at the 
theory level directed at developing a better understanding of the formalisms and 
identifying relationships between formalisms, and at the technology level directed 
at developing software tools for formalisms and applications of formalisms. 

The main European forum for the subject is the European Conference on 
Symbolic and Quantitative Approaches to Reasoning and Uncertainty (EC- 
SQARU). Following the success of the previous ECSQARU conferences, held in 
Marseilles (1991), Granada (1993), Fribourg (1995), and Bonn (1997), the fifth 
conference in the series was held at University College London in July 1999. 

This volume contains papers accepted for presentation at ECSQARU’99. In 
addition to the main conference, two workshops were held. The first was on 
Decision Theoretic and Game Theoretic Agents, chaired by Simon Parsons and 
Mike Wooldridge, and the second was on Logical and Uncertainty Models for 
Information Systems, chaired by Fabio Crestani and Mounia Laimas. Selected 
papers from the workshops are also included in these proceedings. 

We are indebited to the programmme committee for their effort in organising 
the programme, to the invited speakers, and to the presenters of the tutorials. 
Furthermore, we gratefully acknowledge the contribution of the many referees 
who were involved in the reviewing process. Finally we would like to thank the 
Department of Computer Science at University College London for administra- 
tive support. 

Programme Committee 

The programme committee was chaired by Anthony Hunter (University College 
London), and comprised Dov Gabbay (King’s College London), Finn Jensen 
(Aalborg University), Rudolf Kruse (University of Magdeburg), Simon Parsons 
(Queen Mary, University of London) Henri Prade (IRIT, Toulouse), Torsten 
Schaub (University of Potsdam), and Philippe Smets (ULB, Bruxelles). 

Reviewers 

The programme committee is very grateful for all the hard work contributed 
by the reviewers. Hopefully, we have not missed anyone from the following 
list: Bruce D’Ambrosio, Florence Bannay, Salem Benferhat, Philippe Besnard, 
Hugues Bersini, Christian Borgelt, Rachel Bourne, Stefan Brass, Laurence Cholvy, 
Roger Cooke, Adnan Darwiche, Yannis Dimopoulos, Jurgen Dix, Didier Dubois, 
Uwe Egly, Linda van der Gaag, Joerg Gebhardt, Siegfried Gottwald, Rolf Haenni, 
Jean-Yves Jaffray, Radim Jirousek, Ruth Kempson, Uffe Kjaerulf, Frank Kla- 
wonn, Aljoscha Klose, Juerg Kohlas, Paul Krause, Gerhard Lakemeyer, Mounia 
Laimas, Jerome Lang, Kim G. Larsen, Norbert Lehmann, T. Y. Lin, Thomas 
Linke, Khalid Mellouli, Jerome Mengin, J.-J. Ch. Meyer, Sanjay Modgil, Yves 
Moinard, Serafin Moral, Detlef Nauck, Ann Nicholson, Pascal Nicolas, Dennis 
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On the Dynamics of Default Reasoning 



Grigoris Antoniou 



Griffith University, QLD 4111, Australia 
University of Macedonia, Thessaloniki, Greece 
gaOcit . gu . edu . au 



Abstract. Default logic is a prominent rigorous method for reasoning 
with incomplete information based on assumptions. It is a static reason- 
ing approach, in the sense that it doesn’t reason about changes and their 
consequences. On the other hand, its nonmonotonic behaviour appears 
when changes to a default theory are made. 

This paper studies the dynamic behaviour of default logic in the face of 
changes. We consider the operations of contraction and revision, present 
several solutions to these problems, and study their properties. 



1 Introduction 

Nonmonotonic reasoning comprises qualitative techniques for reasoning 

with incomplete information. Default logic m is a specific nonmonotonic rea- 
soning method which is based on the use of default rules representing plausible 
assumptions. 

The use of default rules in a knowledge base is beneficial for practical pur- 
poses. It is well-recognized that humans maintain multiple sets of beliefs which 
are often mutually contradictory. The choice of a belief set is then driven by the 
context. Default logic supports this phenomenon since its semantics is based on 
alternative extensions of a given default theory. 

The use of default rules has been proposed for the maintenance of software 
m in general, and requirements engineering in particular. When the 

requirements of a software product are collected it is usual that potential stake- 
holders have conflicting requirements for the system to be developed. As has been 
shown, default logic representations may adequately model various facets of this 
problem; in particular they allow for the representation of conflicting require- 
ments within the same model, and they highlight potential conflicts and possible 
ways of resolving them. These are central problems in the area of requirements 
engineering PP. 

One central issue of requirements engineering that cannot be addressed by 
pure default logic is that of evolving requirements. For example, requirements 
may evolve because requirements engineers and users cannot possibly foresee all 
the ways is which a system may be used, or because the software context changes 
over time. 

While default logic supports reasoning with incomplete (and contradictory) 
knowledge it does not provide means for reasoning with change. The purpose of 
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this paper is to study ways in which default logic can be enhanced by dynamic 
operators. In particular we shall investigate two operations: revision, which is 
concerned with the addition of new information in a way that does not lead to 
conflicting knowledge (in the sense described in later sections); and contraction, 
which achieves the deletion of some information, without necessarily adding its 
negation to the current knowledge. 

While belief revision m has been studying change in classical logic knowl- 
edge bases since the mid 80s, the problem of revising default theories has at- 
tracted attention only recently, and not much has been achieved in this direction. 
A notable exception is m which reports on the implementation of a specific re- 
vision and contraction strategy as part of the GIN project. 

In our exposition we will present two different families of approaches. It is 
easy to see that if we use default logic and its standard variants, there is an 
asymmetry between revision and contraction: while revision can be achieved by 
manipulating the set of facts only (if we are interested in adding a formula to all 
extensions), this cannot be done in the case of contraction, because there exist no 
“negative facts” . Thus the contraction has to be implemented by manipulating 
the defaults. 

If this manipulation of defaults is undesirable (for example because defaults 
are meant to express long term rules) then we propose the use of preconstrained 
default theories. The idea is similar to the use of constraints in the Theorist 
framework EI3 and involves the formulation of formulae which should not be 
included in any extension. When constraints are used, revision and contraction 
can be achieved by manipulating only the facts and constraints, while the de- 
faults remain unaltered. Revision in the context of Theorist was recently studied 

in 0 - 

Finally we wish to point out that this paper follows the basic idea of avoiding 
extensive computations with defaults, relying instead on a static analysis of 
default theories. This is done at the expense of somewhat ad hoc solutions, as 
well as the neglect of minimal change ideas. 

2 Basics of Default Logic 

A default S has the form closed formulae (p, ifi, . . X- ‘P is the 

prerequisite pre{S), the justifications just{S), and % the consequent 

cons{5) of (5. A default theory T is a pair {W,D) consisting of a set of formulae 
W (the set of facts) and a countable set D of defaults. In case D is finite and 
W is finitely axiomatizable, we call the default theory T = {W, D) finite. In this 
paper we consider finite default theories, since they are the most relevant for 
practical purposes. 

A preconstrained default theory is a triple T = (W, D, C) such that W and 
C are sets of formulae, and D a countable set of defaults. We assume that 
Th{W)C^C = %. 

Let 5 = default, and E a deductively closed set of formulae. 

We say that 5 is applicable to E iS ip € E, and -'fi’i, . . . , ^ E. 
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Let n = {Sq, (5i, 62 ,- ■ ■) a finite or infinite sequence of defaults from D 
without multiple occurrences (modelling an application order of defaults from 
D). We denote by II[k] the initial segment of U of length k, provided the length 
of n is at least k. 

— In{n) — Th{Wyj{cons{5) \ S occurs in II}), where Th denotes the deductive 
closure. 

— Out{n) = I if) G just{S ),6 occurs in 77} (UC in case T is precon- 
strained) . 

77 is called a process ofT iff 5k is applicable to 7n(77[7]), for every k such that 
Sk occurs in 77. 77 is successful iff In(II) fl Out{II) = 0, otherwise it is failed. 
77 is closed iff every default that is applicable to In{II) already occurs in 77. 0 
shows that Reiter’s original definition of extensions is equivalent to the following 
one: A set of formulae E is an extension of a default theory T iff there is a closed 
and successful process II of T such that E = In{II). Finally we define: 

— T hsc iff is included in all extensions of T. 

— T \~c,r ^ iff there is an extension E of T such that (p G E. 

3 Dynamic Operators For Standard Defanlt Theories 

Suppose we have a default theory T = (W, E) and wish to add new information in 
the form of a formula (p. First we have to lay down which interpretation of default 
logic we will be using: extensions, credulous reasoning, or sceptical reasoning. In 
this paper we view a default theory as a top-level specification which supports 
several alternative views (the extensions). Therefore we wish to make changes 
at the level of a default theory, rather than revising single extensions. 

One advantage of this design decision is our approach incorporates lazy eval- 
uation: computation is deferred until it becomes necessary. In our case we make 
changes to the default theory without the necessity to compute extensions at ev- 
ery step. Default logic supports this approach since a default theory represents 
a set of alternative extensions. 

3.1 Revision - Adding a Formula to All Extensions 

This task can be achieved by adding (/? to the set of facts W. Of course if we do 
it in the naive way of just adding it as a new fact, the set of facts may become 
inconsistent. In that case the new default theory has only one extension, the set 
of all formulae. This is undesirable, and we resolve the problem by revising the 
set of facts W with (p in the sense of belief revision. This ensures that (p is added 
to W in such a way that consistency is maintained and the changes made are 
minimal. 

Definition 1. Let T = (W, D) be a default theory, and p a formula. Define 
T+ = (W;,D), where * is a theory base revision operator. 
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Note that in the definition above we are using a theory base revision operator 
instead of an AGM revision operator. The reason is that AGM revision operators 
P apply to deductively closed (classical) theories. Theoretically we can assume 
that W is deductively closed, indeed according to the definition of default logic in 
section 2, there is no difference between the use of W and Th{W). But in practice 
W will be a finite axiomatization of the facts, thus we need to use operators which 
work on finite bases rather than deductively closed theories. More information on 
these issues, including the relationship between theory base revision and AGM 
belief revision are found in HS|. 

Theorem 2. (a) If ip is not a contradiction, then hgc ^p- 

(b) If E is an extension of T such that p G E, then E is an extension of T+ . 



3.2 Revision - Adding a Formula to At Least One Extension 

In this case we don’t necessarily want to add p to every extension. On the other 
hand, it is difficult to identify statically, that means without actually comput- 
ing the extensions, a default 6 contributing to an extension, and add p to its 
consequent. 

The solution we propose is to add a new default which leads to a new ex- 
tension containing p. The other extensions should remain the same. The naive 
approach of just adding the default jg ^ot sufficient. For example, the 

existence of another default would destroy any extension containing p. 

Thus we must take care to separate the default to be added from all previous 
defaults. In addition we need to remove ^p from the set of facts W. In case 
->p G Th(W) we turn the formulae removed from W into a defaullQ. Let us first 
outline this construction before giving the formal definition: 



1. Remove -ip from Th{W) using theory base contraction. This is necessary, 
otherwise p will be excluded from all extensions. 

2. Add the default S = ^ where p is a new atom. Note that <5 is applicable 

to the new set of facts, cfue to step 1 and the condition on p. According to 
the steps 3 and 4, all other defaults will be inapplicable after the application 
of <5. Thus S is guaranteed to lead to an extension which contains p. 

3. Add -'p to the justification and consequent of every (old) default in D. This 
ensures that S does not interfere in any way with the other defaults. 

4. Turn all formulae deleted from W in step 1 into defaults, and take care that 
they work in conjunction with the old defaults only (using -ip as in step 3). 



Definition 3. Let T = (IT, D) be a default theory and p a formula. Define an 
apcrator,„Tp a.fallo«,s: | i £ 

' true:—ip 





D} U { 

contraction operator. 



ere ip 1 ip/\p cons(^S)A—>p 

if G W — IT“ }); where p is a new atom and ~ a theory base 



^ In general, contraction of a formula a from a theory T may result in the contraction 
of other formulae, too. For example, if j3 and (3 ^ a are included in T, then at least 
one of them has to be removed. 
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Theorem 4. Let T he a default theory over the signature (logical language) S, 
and ip a formula over S which is not a contradiction. Then: 

(a) cre'^tp^ ^cr T' 

(h) E is an extension ofT iff there is an extension E' of creTi^^ such thatp ^ E' 
and E' / S = E, where / E denotes restriction to the signature E. 

(b) states that if we forget about the new symbol p, all extensions of T are 
preserved. Together with (a) it says that we have kept the old extensions and 
just added a new one which contains tp. 

This construction looks rather complicated. We note that there are cases in 
which it can be simplified significantly. For example in case we restrict atten- 
tion to normal default theories, or if we consider a semi-monotonic variant of 
default logic, such as Constrained Default Logic and Justified Default Logic 
m- In those cases addition of the default (almost) does the work, since, 

due to semi-monotonicity, once the default is applied it will ultimately lead to 
an extension which contains ip P]. The only thing needed is to treat the facts 
adequately, as done before. 

In the rest of this subsection we present the definitions and results for normal 
default theories in standard default logic, but note that similar results can be 
derived for semi-monotonic variants of default logic without the restriction to 
normal theories. 



Definition 5. Let T = {W,D) he a normal default theory and ip a formula. 
Define an operator creTp^ 0 'S follows: 



- If^p^ThiW) then D). 

- If Th{W) then ,reT+- = UDU \ € W - 

WEip}), where p is a new atom and ~ a theory base contraction operator. 



Theorem 6. Let T he a default theory over the signature (logical language) E, 
and ip a formula over E which is not a contradiction. Then: 



(a) creT;^^ \~cr P- 

(h) If E is an extension of T then there is an extension E' of creT^^ such that 
E' f E D E; if -<p G Th(W) then even E' / E = E; 



The opposite direction of (b) is not necessarily true. Take p to he q, E = 
{q,r}, and T to consist of the fact ^q and the default — . T has the single 
extension E = Th{{-'q}), while creT^'^ has two extensions, E[ = T/i({-ip, ->g}) 
and E '2 = Th{{p,q,r}). E[fE = E, as predicted by theorem 3, but E'^/ E = 
Th{{q,r}) is not an extension of the original theory T. 

For normal default theories we can propose an alternative approach. In the 
beginning of this subsection we said that identifying a default which is guaran- 
teed to lead to an extension is nontrivial. In case semi-monotonicity is guaran- 
teed, any default which is initially applicable will do. 



6 



Grigoris Antoniou 



Definition 7. Let T = {W,D) he a normal default theory, and ip a non- 
contradictory formula. Suppose that there is a default 6 € D such that (i) S 
is applicable to Th{W); and (ii) W U just{S)U {p} is consistent. Then we define 

= (H-; (D - { J}) u { }) . 

Theorem 8. Suppose that the conditions of the previous definition hold. Then: 

(a) creT+^ \~cr P- 

(b) Every extension E ofT, such that S is not applicable to E, is also an exten- 
sion of creT+^. 

(c) Every extension E ofT such that 6 is applicable to E, Th{E\j{p}) is included 
in an extension of creTi^^ . 

3.3 Contraction - Removing a Formula Prom All Extensions 

Of course we could remove p by adding -^p to the set of facts, but that it not 
what we wish to achieve: Our aim is to delete p without adding ->p. 

Here is the approach we propose. First we need to remove p from the facts, 
using a classical theory base contraction strategy. 

Then we need to make sure that p is not reintroduced by the application 
of some defaults. One way of ensuring this is to add the default which 

destroys (that is, turns to failed) any process which includes p in its In-set. 
This approach is not unappealing from the viewpoint of maintainability and 
naturalness: All defaults are kept as they are, and a new is added which has 
a clear meaning (it resembles the use of the “cut /fail” combination in logic 
programming) . 

Definition 9. LetT = (W, D) be a default theory, and p a formula. We define: 
,,eT- = (VF-,DU{f^}). 

Theorem 10. Let T he a default theory, and p a non-tautological formula. 
Then: 

(d) ere T,p ifer T- 

(h) Lf E is an extension of T with p ^ E, then E is also an extension of creT~ . 
Alternatively, one may add -'p to the justifications of all defaults in D. 



3.4 Contraction - Removing a Formula Prom At Least One 
Extension 

This task causes similar problems to the revision case. We propose instead to do 
something different, which has formally the right effect: we add a new extension 
which does not include p. This way p is not contained in at least one extension 
of the new default theory. Thus we achieve, at least, that p is not included in 
the sceptical interpretation of the default theory at hand. 
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Definition 11. Let T = (W, D) he a default theory and ip a formula. Define 

I <5 e D} U 1^}), 

where p is a new atom. 

Theorem 12. Let E he a signature, T a default theory over E, and p a formula 
over E. Then: 

(a) T~ \fsc P- 

(h) E is an extension of T iff there is an extension E' of T~ such that p ^ E' 
and E' ! E = E. 

Thus if we forget about the new symbol p, all extensions of T are preserved, 
but there is a new one which does not contain p. 

4 Dynamic Operators for Preconstrained Theories 

We mentioned already the asymmetry inherent to approaches such as those 
described above; it is caused by the asymmetry of the default theories themselves: 
it is possible to express which formulae should be included in all extensions, 
but not which formulae to exclude. This asymmetry is removed when we study 
preconstrained default theories. New possibilities arise for the contraction and 
revision of theories, and we will describe some of them below. But first we make 
the following remark: if we manipulate facts and constraints only and leave the 
defaults unaltered, then revision and contraction operators will achieve results 
for all extensions. Thus in the following we will restrict our attention to the 
addition of a formula to all extensions, and the removal of a formula from all 
extensions. 



4.1 Revision 

We define a revision operator as follows: 

Definition 13. iW,D,C)^^ is a set {W',D,C') such that: 

(a) W' C W\J{p} and for all W C W\J{p}, ifW" D W' then Th(W'')nC' yf 0. 
fhj C' C C, and for all C" C C, if C" D C" then Th{W') n C" yf 0. 

(c) p €W' . 

In other words, W must include p, and W' and C are minimal such that 
the property Th{W) fl C" = 0 is preserved. 

Theorem 14. Let p he satisfiahle. 

(a) p is included in all extensions o/ (IT, D, C)+b 

(h) Lf E is an extension of (W,D,C) and p € E, then E is also an extension of 
{W,D,C)p. 
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We note that this approach is different from building the revision on facts 
based on classical belief revision. For example, consider the theory ({p — >■ <?},£*, 
{g}), and suppose that we wish to add p. The approach we propose admits a 
solution, {{p},D,{q}), which is not admitted by the pure belief revision ap- 
proach on the set of facts. Negative information has the same weight as positive 
information! 

In the approach described above any facts retracted are lost forever. But 
it is possible to store them as defaults, which cannot be applied at present, 
but could become applicable later after further revisions of the default theory 
(which, for example, may cause tp to be retracted) . Thus the following approach 
may have advantages for iterated revision. But note that this way we reintroduce 
an asymmetry, albeit of a different kind: whereas retracted facts can be kept as 
defaults, retracted constraints are lost. 

Definition 15. (W,D,C)^^ is a set (W , D,C') such that properties (a), (b) 
and (c) from the previous definition and the following condition hold: 

(d) = : V-G (VFU{p})-W'}. 

Theorem 16. (W, 11,(7) and (1F,Z1,(7)+^ have the same extensions. 



4.2 Contraction 

Here we reap the benefits which derive from the symmetry between positive 
and negative information: contraction of (p is achieved simply by adding p to 
the constraints (followed, as before, by the reduction of the sets of facts and 
constraints in such a way that the consistency condition Th{W) fl (7 = 0 is 
satisfied) . 

Definition 17. {W,D,C)~ = (W' , D,C') such that: 

(a) W' C W, and for all W C W, if W" D W' then Th{W”) (7(7' yf 0. 

(h) C C C\J{ip}, and for all C" C (7U{p}, if C D C then T/i(IT') 7(7" yf 0. 
(c) (f G (7'. 

Theorem 18. Let he non-tautological. 

(a) ip is not included in any extension of{W,D,C)~. 

(b) If E is an extension of (W,D,C) and p ^ E, then E is an extension of 
(W,D,C)-. 



A variant for iterated change is possible as in the case of revision. 
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5 Conclusion 

In this paper we defined a problem that has attracted little attention so far: 
the revision of default theories. We believe that belief revision techniques should 
move beyond classical representations and consider more rich knowledge struc- 
tures. 

We considered the two main operators of revising knowledge bases, revision 
(addition of information) and contraction (deletion of information), and did this 
in the two main approaches to default reasoning: the sceptical case and the 
credulous case. In most cases we gave more than one possible solutions, and 
discussed their respective advantages and disadvantages. 

In a companion paper we provide a deeper study of the revision of precon- 
strained default theories. 

Two specific techniques have already been implemented in the system (see 
P3). The focus of that work was how to make the change without having to 
recalculate the extensions of the revised default theory from scratch. Our work 
has shown that considerable computational effort can be avoided by carrying over 
part of the computation from the old to the new theory. This paper is orthogonal 
to PS] in the sense that the techniques developed there will be used to save 
computational effort when the revision methods of this paper are implemented. 

We intend to study other approaches of revising default theories, beginning 
from desirable abstract postulates and moving towards concrete realizations that 
satisfy them. Such an approach may require more computational effort, but is 
more in line with classical belief revision. 
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Abstract. In this paper we develop frameworks for logical systems 
which are able to reflect not only nonmonotonic patterns of reason- 
ing, but also paraconsistent reasoning. For this we consider a sequence 
of generalizations of the pioneering works of Gabbay, Kraus, Lehmann, 
Magidor and Makinson. Our sequence of frameworks culminates in what 
we call plausible, nonmonotonic, multiple-conclusion consequence rela- 
tions (which are based on a given monotonic one). Our study yields 
intuitive justifications for conditions that have been proposed in previ- 
ous frameworks, and also clarifies the connections among some of these 
systems. In addition, we present a general method for constructing plau- 
sible nonmonotonic relations. This method is based on a multiple-valued 
semantics, and on Shoham’s idea of preferential models. 



1 Introduction 

Our main goal in this paper is to get a better understanding of the conditions 
that a useful relation for nonmonotonic and paraconsistent [5] reasoning should 
satisfy. For this we consider a sequence of generalizations the pioneering works 
of Gabbay [7], Kraus, Lehmann, Magidor [8] and Makinson [12]. These general- 
izations are based on the following ideas: 

— Each nonmonotonic logical system is based on some underlying monotonic 
one. 

— The underlying monotonic logic should not necessarily be classical logic, 
but should be chosen according to the intended application. If, for example, 
inconsistent data is not to be totally rejected, then an underlying paracon- 
sistent logic might be a better choice than classical logic. 

— The more significant logical properties of the main connectives of the under- 
lying monotonic logic, especially conjunction and disjunction (which have 
crucial roles in monotonic consequence relations) , should be preserved as far 
as possible. 

— On the other hand, the conditions that define a certain class of nonmonotonic 
systems should not assume anything concerning the language of the system 
(in particular, the existence of appropriate conjunction or disjunction should 
not be assumed). 



A. Hunter and S. Parsons (Eds.): ECSQARU’99, LNAI 1638, pp. 1 1-21, 1999. 
© Springer-Verlag Berlin Heidelberg 1999 
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Our sequence of generalizations culminates in what we call (following Lehmann 
[9]) cautious plausible consequence relations (which are based on a given mono- 
tonic one.^) 

Our study yields intuitive justifications for conditions that have been pro- 
posed in previous frameworks and also clarifies the connections among some of 
these systems. Moreover, while the logic behind most of the systems which were 
proposed so far is supraclassical (i.e., every first-degree inference rule that is 
classically sound remains valid in the resulting logics) , the consequence relations 
considered here are also capable of drawing conclusions from incomplete and 
inconsistent theories in a nontrivial way. 

In the last part of this paper we present a general method for constructing 
such plausible nonmonotonic and paraconsistent relations. This method is based 
on a multiple-valued semantics, and on Shoham’s idea of preferential models 
[ 21]. 2 



2 General Background 



We first briefly review the original treatments of [8] and [12]. The language they 
use is based on the standard propositional one. Here, denotes the material 
implication and ~ denotes the corresponding equivalence operator. The clas- 
sical propositional language, with the connectives V, A, ~, and with a 
propositional constant t, is denoted here by T’d. 



Definition 1. [8] Let hd be the classical consequence relation. A binary rela- 
tion^ |~' between formulae in is called cumulative if it is closed under the 
following inference rules: 



reflexivity: 

cautious monotonicity: 
cautious cut: 
left logical equivalence: 
right weakening: 



Ip \^' ip. 

if tp\^' (p and xp [~'r, then 'tpA<p\~^' t. 
if 'tpy^' (p and then xp [~'r. 

if hci ^ ~ ^ and xp |~' r, then ^ r . 
if \-c\xp'^4> &iid r then r 



Definition 2. [8] A cumulative relation |~' is called preferential if it is closed 
under the following rule: 

left \/ -introduction (Or) if ^|~'r and then ^V^|~'r. 



The conditions above might look a little-bit ad-hoc. For example, one might 
ask why is used on the right, while the stronger ~ is used on the left. A 

^ See [4] for another formalism with non-classical monotonic consequence relations as 
the basis for nonmonotonic consequence relations. 

^ Due to a lack of space, proofs of propositions are omitted; They will be given in the 
full version of the paper. 

® A “conditional assertion” in terms of [8]. 
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discussion and some justification appears in [8, 11] A stronger intuitive justifi- 
cation will be given below, using more general frameworks. 



In what follows we consider several generalizations of the basic relations 
presented above: 

1. Allowing the use of nonclassical logics (for example: paraconsistent logics) 
as the basis, instead of classical logic. 

2. Allowing the use of a set of premises rather than a single one. 

3. Allowing the use of multiple conclusions relations rather than single conclu- 
sion ones. 



The key logical concepts which stand behind these generalizations are the 
following: 



Definition 3. 

a) An ordinary Tarskian consequence relation {ter, for short) [22] is a binary 
relation h between sets of formulae and formulae that satisfies the following 
conditions:® 

s-TR strong T-reflexivity: for every 

TM T-monotonicity: if F \- ip and F C F' then F' \- ip . 

TC T-cut: a F\\-ip and F2,ip\- (p then Fi,F 2\- (p. 

b) An ordinary Scott consequence relation {scr, for short) [19,20] is a binary 
relation h between sets of formulae that satisfies the conditions below: 

s-R strong reflexivity: if T n then F\- A. 

M monotonicity: if F\- A and F <ZF' , A<ZA' , then F' \- A' . 

C cut: if Fi\-ip,Ai and F2,ip\- A2 then Fi,F 2 \- Ai,A 2- 



Definition 4. 

a) Let j~ be a relation between sets of formulae. 

• A connective A is called internal conjunction (w.r.t. j~) if: 

F,ip,(p 1~ A F,ipA(p 1~ A 

^ F,iPA(P\r-A ^ F,iP,(P\r-A 

• A connective A is called combining conjunction (w.r.t. ]~) if: 



Fy^iP,A Fy^^,A 
F\t-iPA<P,A 



[|~ A]e 



F ]~ ipA(p, A 
F\r^ip,A 



F\^ipA(p,A 

ry-4>,A 



• A connective V is called internal disjunction (w.r.t. ]~) if: 



[|~V]i 



F p-ip,(p,A 
F ]~ ipV(p, A 



[|~ V]e 



F Ip V (p, A 
F 1~ Ip, 4>, A 



Systems that satisfy the conditions of Definitions 1, 2, as well as other related sys- 
tems, are also considered in [6, 13, 18, 10]. 

® The prefix “T” reflects the fact that these are Tarskian rules. 
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• A connective V is called combining disjunction (w.r.t. if: 

r,tp A A r,tpV(i> a r,tpV(i> a 

^ r,ipv(f>y-A ^ r,ipy^A r,<f>\^A 

b) Let |~ be a relation between sets of formulae and formulae. The notions 
of combining conjunction, internal conjunction, and combining disjunction are 
defined for |~ exactly like in case (a). 

Note: If h is an scr (ter) then A is an internal conjunction for h iff it is a 
combining conjunction for h. The same is true for V in case h is an scr. This, 
however, is not true in general. 



3 Tarskian Cautious Consequence Relations 

Definition 5. A Tarskian cautious consequence relation (teer, for short) is a 
binary relation between sets of formulae and formulae in a language S that 
satisfies the following conditions:® 

s-TR strong T-reflexivity: for every 

TCM T-cautious monotonicity: if T|~^ and r\^<p, then T, 

TCC T-cautious cut: if T|~^ and T, then 

Proposition 6. Any teer is closed under the following rules for every n: 

TCMW if (* = !,• • -,n) then T, V'l, • • • , V'n-i hV'n- 

TCCI"! if (i = l,. . .,n) and T,^i, . . .^„ then r\^(f>. 

We now generalize the notion of a cumulative entailment relation. We first 
do it for Tarskian consequence relations h that have an internal conjunction A. 

Definition 7. A teer |~ is called {A,\-}-cumulative if it satisfies the following 
conditions: 

• if ^ \-<p and and then ^|~r. (weak left logical equivalence) 

• if ^ and r |~^, then r |~^. (weak right weakening) 

• A is also an internal conjunction w.r.t. |~. 

If, in addition, h has a combining disjunction V, then |~ is called {V,A,I-}- 
preferential if it also satisfies the single-conclusion version of [V )~]i. 

Proposition 8. Suppose |~ is hd-cumulative [hd-preferential] . Let ^ iff 
^ Then w.r.t. T’d, |~' is cumulative [preferential] in the sense of [8]. 

Conversely: if |~' is cumulative [preferential] in the sense of [8] and we define 
^ iff A. . .A j), then j~ is hd-cumulative [hd-preferential] . 



® This set of conditions was first proposed in [7]. 
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We next generalize the definition of a cumulative tccr to make it independent 
of the existence of an internal conjunction. 

Proposition 9. Let h be a ter, and let |~ be a tccr in the same language. The 
following connections between h and are equivalent: 

TCum T-cumulativity for every if then T|~^. 

TLLE T-left logical equiv. if T, V' '^4> and T, ^ and r^'tp |~r, then T, 

TRW T-right weakening ii r,tp \-(f> and r\^tp, then T|~^. 

TMiC T-mixed cut: for every F if and T, V'|~ ipi then F\^<p. 

Definition 10. Let h be a ter. A tccr in the same language is called h- 
cumulative if it satisfies any of the conditions of Proposition 9. If h has a com- 
bining disjunction V, and satisfies [V |~]i, then |~ is called {V, \~} -preferential. 

Note: Since F\-'tp for every tp^F, TCum implies s-TR, and so a binary relation 
that satisfies TCum, TCM, and TCC is a h-cumulative tccr. 

Proposition 11. Suppose that h is a ter with an internal conjunction A. A 
tccr |~ is a {A, h}-cumulative iff it is h-cumulative. If h has also a combining 
disjunction V, then |~ is {V, A, h}-preferential iff it is {V, h}-preferential. 

Proposition 12. Let be a h-cumulative relation, and let A be an internal 
conjunction w.r.t. h. Then A is both an internal conjunction and a combining 
conjunction w.r.t. 

4 Scott Cautious Consequence Relations 

Definition 13. A Scott cautious consequence relation (seer, for short) is a bi- 
nary relation |~ between nonempty^ sets of formulae that satisfies the following 
conditions: 

s-R strong re flexivity: if T fl then 

CM cautious monotonicity: if T|~^ and F\^A then T, 

cautious 1-cut: if T|~^ and F,'ij)\^A then T|~Zi. 

A natural requirement from a Scott cumulative consequence relation is that 
its single-conclusion counterpart will be a Tarskian cumulative consequence re- 
lation. Such a relation should also use disjunction on the r.h.s. like it uses con- 
junction on the l.h.s. The following definition formalizes these requirements. 

Definition 14. Let h be an scr with an internal disjunction V. A relation |~ 
between nonempty finite sets of formulae is called {V cumulative seer if it 

’’ The condition of non-emptiness is just technically convenient here. It is possible to 
remove it with the expense of complicating somewhat the definitions and propo- 
sitions. It is preferable instead to employ (whenever necessary) the propositional 
constants t and / to represent the empty l.h.s. and the empty r.h.s., respectively. 
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is an sccr that satisfies the following two conditions: 

a) Let l-T and be, respectively, the single-conclusion counterparts of h and 
\^. Then |~t is a l-T-cumulative tccr. 

b) V is an internal disjunction w.r.t. |~t as well. 

Following the line of what we have done in the previous section, we next spec- 
ify conditions that are equivalent to those of Definition 14, but are independent 
of the existence of any specific connective in the language. 

Definition 15. Let h be an scr. An sccr |~ in the same language is called weakly 
\- -cumulative if it satisfies the following conditions: 

Cum cumulativity: if T, and F\-A, then T|~Zi. 

RWbl right weakening: if r,ip\-(f> and r\^ip,A, then r\^(f>,A. 

RM right monotonicity: if T|~Zi then 

Proposition 16. Let h and V be as in Definition 14. A relation |~ is a {V, h}- 
cumulative sccr iff it is a weakly h-cumulative sccr. 

Proposition 17. If h has an internal disjunction, then is a weakly h-cumu- 
lative sccr if it satisfies Cum, CM, CC^l, and RW^l. 

We turn now to examine the role of conjunction in the present context. 

Proposition 18. Let h be an scr with an internal conjunction A, and let |~ be 
a weakly h-cumulative sccr. Then: 

a) A is an internal conjunction w.r.t. |~. 

b) A is a “half” combining conjunction w.r.t. |~. I.e, it satisfies [[~A]e. 

Definition 19. Suppose that an scr h has an internal conjunction A. A weakly 
h-cumulative sccr is called {A,\-}-cumulative if A is also a combining con- 
junction w.r.t. |~. 

As usual, we provide an equivalent notion in which one does not have to 
assume that an internal conjunction is available: 

Definition 20. A weakly h-cumulative sccr is called h-cumulative if for every 
finite n the following condition is satisfied: 

RWW if r\r^iPi,A (i = l,...,n) and T, , . . . , b ^ then T |~ ^, A. 

Proposition 21. Let A be an internal conjunction for h. An sccr |~ is {A, h}- 
cumulative iff it is h-cumulative. 

Corollary 22. If h is an scr with an internal conjunction A and is a h- 
cumulative sccr, then A is a combining conjunction and an internal conjunction 

w.r.t. 

Let us return now to disjunction, examining it this time from its combining 
aspect. Our first observation is that unlike conjunction, one direction of the 
combining disjunction property for |~ of V yields monotonicity of 
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Lemma 23. Suppose that V is an internal disjunction for h and is a weakly 
h-cumulative sccr in which [V [~]e is satisfied. Then is (left) monotonic. 

It follows that requiring [V )~]e from a weakly h-cumulative sccr is too strong. 
It is reasonable, however, to require its converse. 

Definition 24. A weakly h-cumulative sccr is called weakly {\/ ,\~} -preferen- 
tial if it satisfies [V )~]i. 

Unlike the Tarskian case, this time we are able to provide an equivalent 
condition in which one does not have to assume that a disjunction is available: 

Definition 25. Let h be an sccr. A weakly h-cumulative sccr is called weakly 
\- -preferential if it satisfies the following rule: 

CC eautious eut: if T Zi and T, ^ |~Zi then T 

Proposition 26. Let h be an scr with an internal disjunction V. An sccr |~ is 
weakly {V, h}-preferential iff it is weakly h-preferential. 

Some characterizations of weak h-preferentiality are given in the following 
proposition: 

Proposition 27. Let h be an scr. 

a) is a weakly h-preferential sccr iff it satisfies Cum, CM, CC, and RM. 

b) is a weakly h-preferential sccr iff it is a weakly h-cumulative sccr and for 
every finite n it satisfies eautious n-eut: 

CCi"l if |~Zi {i = l, . . . ,n) and . . . ,ip„, then F\^A. 

Note: By Proposition 6, the single conclusion counterpart of CCi"l is valid for 
any sccr (not only the cumulative or preferential ones). 

We are now ready to introduce our strongest notions of nonmonotonic Scott 
consequence relation: 

Definition 28. Let h be an scr. A relation is called \~ -preferential iff it is 
both h-cumulative and weakly h-preferential. 

Proposition 29. Let h be an scr. |~ is h-preferential iff it satisfies Cum, CM, 
CC, RM, and RW^"! for every n. 

Proposition 30. Let h be an scr and let |~ be a h-preferential sccr. 

a) An internal conjunction A w.r.t. h is also an internal conjunction and a 
combining conjunction w.r.t. |~. 

b) An internal disjunction V w.r.t. h is also an internal disjunction and “half” 
combining disjunction w.r.t. |~.® 



I.e., |~ satisfies [V|~]i (but not necessarily [V|~]b. 
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CCW (n > 1), which is valid for h-preferential seers, is a natural generalization 
of cautious cut. A dual generalization, which seems equally natural, is given in 
the following rule from [9]: 

r |~ A 

Definition 31. [9] A binary relation |~ is a plausibility logic if it satisfies Inclu- 
sion (r,V'|~V'), CM, RM, and LCCW (n>l). 

Definition 32. Let h be an scr. A relation is called -plausible if it is a 
h-preferential scer and a plausibility logic. 

A more concise characterization of a h-plausible relation is given in the fol- 
lowing proposition: 

Proposition 33. Let h be an scr. A relation is h-plausible iff it satisfies 
Cum, CM, RM, and LCC^"! for every n. 

Proposition 34. Let h be an scr with an internal conjunction A. A relation 
is h-preferential iff it is h-plausible. 

Table 1 and Figure 1 summarize the various types of Scott relations consid- 
ered in this section and their relative strengths, h is assumed there to be an 
scr, and V, A are internal disjunction and conjunction (respectively) w.r.t. h, 
whenever they are mentioned. 



Table 1. Scott relations 



consequence relation 


general conditions 


valid conditions with A and V 


scer 


s-R, CM, 




weakly h- cumulative 
scer 


Cum, CM, CC^ RW^ RM 


[A Hu [aHb> [hv]i, 


h-cumulative scer 


Cum, CM, CC‘^J, RW‘”J, RM 


[A Hu [aHb> [h^]u [hv]i, [Hv]b 


weakly h-preferential 
scer 


Cum, CM, CC, RM 


[A Hu [aHb> [vHu [hv]i, [hv]B, 


h-preferential scer 


Cum, CM, CC, RW ”J, RM 


[A Hu [aHb> [h^]u [vHu [hv]u [Hv]b 


h-plausible scer 


Cum, CM, LCC RM 


[A Hu [aHb> [h"^]u [h"A]B, [vHu [hv]i, [^v]b 


scr extending h 


Cum, M, C 


[A Hu [AHb> [h"^]u [HA]b, [VHu [VHb> [hv]i, [^V]b 
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5 A Semantical Point of View 

In this section we present a general method of constructing nonmonotonic con- 
sequence relations of the strongest type considered in the previous section, i.e., 
preferential and plausible seers. Our approach is based on a multiple- valued se- 
mantics. This will allow us to define in a natural way consequence relations that 
are not only nonmonotonic, but also paraconsistent (see examples below). 

A basic idea behind our method is that of using a set of preferential models 
for making inferences. Preferential models were introduced by McCarthy [14] 
and later by Shoham [21] as a generalization of the notion of circumscription. 
The essential idea is that only a subset of its models should be relevant for 
making inferences from a given theory. These models are the most preferred 
ones according to some conditions or preference criteria. 

Definition 35. Let E be an arbitrary propositional language. A preferential 
multiple-valued strueture for E (pms, for short) is a quadruple (£, E, S, <), where 
£, is set of elements (“truth values”), JT is a nonempty proper subset of T, <S is 
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a set of operations on C, that correspond to the connectives in 17, and < is a 
well-founded partial order on C. 

The set T consists of the designated values of £., i.e., those that represent 
true assertions. In what follows we shall assume that £, contains at least the 
classical values t,f, and that 

Definition 36. Let {C,T,S, <) be a pms. 

a) A (multiple-valued) valuation i/ is a function that assigns an element of C to 
each atomic formula. Extensions to complex formulae are done as usual. 

b) A valuation v satisfies a formula if . 

c) A valuation i/ is a model of a set F of formulae, if v satisfies every formula in 

r. The set of the models of F is denoted by mod{F). 

Definition 37. Let {C,F,S, <) be a pms. Denote F\-^’^ A if every model of 
F satisfies some formula in A. 

Proposition 38. is an scr. 

Definition 39. Let {C,F,S, <) be a pms for a language S. 

a) An operator A ® is eonjunetive if Va;, y^C,x/\y^FiQx^F and y£F. 

b) An operator V is disjunetive if Va;, y^C,x\/y^Fi&x^F or yGF. 

Proposition 40. Let {C,F,S, <) be a pms for S, and let A (V) be in S. If the 
operation which corresponds to A (V) is conjunctive (disjunctive), then A (V) is 
both an internal and a combining conjunction (disjunction) w.r.t. . 

Definition 41. Let P be a pms and F a set of formulae in A’. A valuation M € 
mod{F) is a V -preferential model of F if there is no other valuation M' £mod{F) 

s. t. for every atom p, M'{p) <M{p). The set of all the P-preferential models of 
F is denoted by ! (T, V) . 

Definition 42. Let P be a pms. A set of formulae F V -preferentially entails a 
set of formulae A (notation: F\-^’^ A) if every M € \{F,V) satisfies some SgA. 

Proposition 43. Let {C,F,S, <) be a pms. Then is h^’-^-plausible. 

Corollary 44. Let V = {C,F,S, <) be a pms for a language S. 

a) If A is a conjunctive connective of S (relative to V), then it is a combining 
conjunction and an internal conjunction w.r.t. . 

b) If V is a disjunctive connective of F (relative to P), then it is an internal 

disjunction w.r.t. which also satisfies 

Examples. Many known formalisms can be viewed as based on preferential 
multiple- valued structures. Among which are classical logic, Reiter’s closed-world 
assumption [17], the paraconsistent logic LPm of Priest [15,16], and the para- 
consistent bilattice-based logics and bj’'^ [1,2]. 

® We use here the same symbol for a connective and its corresponding operation in S. 
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Abstract This paper reports on a series of experiments performed with 
the aim of comparing the performance of Regular-DP and Regular-GSAT. 
Regular-DP is a Davis-Putnam-style procedure for solving the propo- 
sitional satisfiability problem in regular CNF formulas (regular SAT). 
Regular-GSAT is a GSAT-style procedure for finding satisfying inter- 
pretations in regular CNF formulas. Our experimental results provide 
experimental evidence that Regular-GSAT outperforms Regular-DP on 
computationally difficult regular random 3-SAT instances, and suggest 
that local search methods can extend the range and size of satisfiability 
problems that can be efficiently solved in many-valued logics. 



1 Introduction 

In this paper we investigate the use of both systematic and local search al- 
gorithms for solving the propositional satisfiability problem in regular CNF 
formulas (regular SAT). Concerning systematic search algorithms we focus on 
Regular-DP. It is a Davis-Putnam-style procedure for regular CNF formulas de- 
fined by Hahnle [4], which we have implemented in C++ and equipped with suit- 
able data structures for representing formulas [7]. Here, we describe Regular-DP, 
define a new branching heuristic and report on experimental results that indi- 
cate that Regular-DP with our heuristic outperforms Regular-DP with Hahnle ’s 
heuristic [4]. Concerning local search algorithms we describe a new procedure, 
Regular-GSAT. It is an extension of GSAT [10] to the framework of regular 
CNF formulas that we have designed and implemented in C++. As far as we 
know, Regular-GSAT is the first local search algorithm developed so far for 
many-valued logics. 

In order to compare the performance of Regular-DP and Regular-GSAT we 
performed a series of experiments in a class of computationally difficult instances 
we identified in a previous work [7]. They are satisfiable instances of the hard 
region of the phase transition that we observed in the regular random 3-SAT 
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problem (cf. Section 5) . Our experimental results provide experimental evidence 
that Regular-GSAT outperforms Regular-DP on that kind of instances, and 
suggest that local search methods can extend the range and size of satisfiability 
problems that can be efficiently solved in many-valued logics. 

Regular CNF formulas are a relevant subclass of signed CNF formulas (cf. 
Section 2). Both kinds of formulas are a suitable formalism for representing 
and solving problems in many- valued logics. In fact, they can be seen as the 
clause forms of finitely- valued logics. Given any finitely- valued formula, one can 
derive a satisfiability equivalent signed CNF formula in polynomial time [3]. 
Interestingly, Hahnle [2] identified a broad class of finitely-valued logics, so-called 
regular logics, and showed in [3] that any formula of one of such logics can be 
transformed into a satisfiability equivalent regular CNF formula whose length is 
linear in both the length of the transformed formula and the cardinality of the 
truth value set. Recently, he has found a method for translating a signed CNF 
formula into a satisfiability equivalent regular CNF formula where the length of 
the latter is polynomial in the length of the former [5]. Thus, an instance of the 
satisfiability problem in any finitely-valued logic is polynomially reducible to an 
instance of regular SAT. In a sense, regular CNF formulas play in finitely-valued 
logics the same role as (classical) CNF formulas play in classical logic. 

This paper is organized as follows. In Section 2 we define the logic of signed 
and regular CNF formulas. In Section 3 we describe Regular-DP and a new 
branching heuristic. In Section 4 we describe Regular-GSAT. In Section 5 we 
explain the generation of regular random 3-SAT instances and the phase transi- 
tion phenomenon. In Section 6 we report on our experimental investigation. We 
finish the paper with some concluding remarks. 

2 Signed CNF formulas 

In this section we define the logic of signed CNF formulas. Regular CNF formulas 
are presented as a subclass of signed CNF formulas. 

Definition 1. A truth value set N is a finite set {ii,i 2 , ■ ■ ■ ,in}, where n € N. 
The eardinality of N is denoted by lA"!. 

Definition 2. Let S be a subset of N (S C N) and let p be a propositional 
variable. An expression of the form S:p is a signed literal and S is its sign. The 
eomplement of the signed literal L = S :p, denoted by L = S : p, is (N \ S) :p. 
A signed literal S :p subsumes a signed literal S' :p' , denoted by S : p C S' : p' , 
iff P = p' d'nd S C S' . A signed clause is a finite set of signed literals. A 
signed elause eontaining exaetly one literal is a signed unit clause. A signed 
CNF formula is a finite set of signed elauses. 

Definition 3. The length of a sign S, denoted by |5|, is the number of truth 
values that oeeur in S. The length of a signed elause C , denoted by \C\, is the 
total number of oeeurrenees of signed literals in C . The length of a signed CNF 
formula F , denoted by |T|, is the sum of the lengths of its signed elauses. 
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Definition 4. An interpretation is a mapping that assigns to every proposi- 
tional variable an element of the truth value set. An interpretation I satisfies a 
signed literal S :p iff I (p) G S. An interpretation satisfies a signed elause iff it 
satisfies at least one of its signed literals. A signed CNF formula F is satisfi- 
able iff there exists at least one interpretation that satisfies all the signed elauses 
in F. A signed CNF formula that is not satisfiable is unsatisfiable. The signed 
empty elause is always unsatisfiable and the signed empty CNF formula is always 
satisfiable. 

Definition 5. Let fi denote the set {j G N \j > i} and let \.i denote the set 
{j & N \j < i}, where < is a total order on the truth value set N and i G N. If 
a sign S is equal to either fi or \.i, for some i, then it is a regular sign. 

Definition 6. Let S be a regular sign and let p be a propositional variable. An 
expression of the form S : p is a regular literal. If S is of the form fi 
then we say that S :p has positive (negative) polarity. A regular clause ("CNF 
formula^ is a signed elause (CNF formula) whose literals are regular. 

3 The Regular-DP procedure 

In this section we first describe Regular-DP [4] and then a new branching heuris- 
tic. Regular-DP is based on the following rules: 

Regular one-literal rule: Given a regular CNF formula F containing a regular 
unit clause {5:p}, 

1. remove all clauses containing a literal S' :p such that S C S'] 

2. delete all occurrences of literals S" :p such that S Ci S" = 0. 

Regular branehing rule: Reduce the problem of determining whether a regu- 
lar CNF formula F is satisfiable to the problem of determining whether 
CU {S'lp} is satisfiable or CU {S'lp} is satisfiable, where S'.pis a regular 
literal occurring in F . 

The pseudo-code of Regular-DP is shown in Figure 1. It returns true (false) 
if the input regular CNF formula F is satisfiable (unsatisfiable). First, it ap- 
plies repeatedly the regular one-literal rule and derives a simplified formula F' . 
Once the formula cannot be further simplified, it selects a regular literal S : p 
of T', applies the branching rule and solves recursively the problem of deciding 
whether F' U {5:p} is satisfiable or F' U {5:p} is satisfiable. As such subprob- 
lems contain a regular unit clause, the regular one-literal rule can be applied 
again. Regular-DP terminates when some subproblem is shown to be satisfiable 
by deriving the regular empty CNF formula or all the subproblems are shown 
to be unsatisfiable by deriving the regular empty clause in all of them. In the 
pseudo-code, Fs-.p denotes the formula obtained after applying the regular one- 
literal rule to a regular CNF formula F using the regular unit clause {5:p}. 

Regular-DP can be viewed as the construction of a proof tree using a depth- 
first strategy. The root node contains the input formula and the remaining nodes 
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procedure Regular-DP 

Input: a regular CNF formula F and a truth value set N 
Output: true if F is satisfiable and false if F is unsatisfiable 
begin 

if _r = 0 then return true; 
if □ € -T then return false; 

/* regular one-literal rule*/ 

if F contains a unit clause {5':p} then Regular-DP(Ps/.p); 
let 5 :p be a regular literal occurring in F- 
/* regular branching rule * / 
if Regular-DP(Ps;p) then return true; 
else return Regular-DP(Pg.p); 
end 



Figurel. The Regular-DP procedure 



represent applications of the regular one-literal rule. Otherwise stated, there is 
one node for each recursive call of Regular-DP. When all the leaves contain the 
regular empty clause, the input formula is unsatisfiable. When some leaf contains 
the regular empty formula, the input formula is satisfiable. 

Observe that the (classical) Davis-Putnam procedure is a particular case of 
Regular-DP. In order to obtain the Davis-Putnam procedure, we should take 
N = {0, 1} and represent a classical positive (negative) literal p (-ip) by the 
regular literal 1 1 : P (J^ 0 : p) . 

Example 1. Let N = {0, 1,2} and let F be the following regular CNF formula: 



{{}0 :pi,}l :p2},{tl :pi,}0 :p2},{}0 :pi,t2 ips), 
{t2:p2,tl:P3},{t2:p2,;0:p3}} 



Figure 2 shows the proof tree created by Regular-DP when the input is F. Edges 
labelled with a regular CNF formula indicate the application of the regular 
branching rule and the regular one-literal rule using the regular unit clause in 
the label. Edges labelled with a regular unit clause indicate the application of 
the regular one-literal rule using that clause. 

The performance of Regular-DP depends dramatically on the heuristic that 
selects the next literal to which the branching rule is applied. Hahnle defined 
a branching heuristic which is an extension of the two-sided Jeroslow-Wang 
rule [4]: given a regular CNE formula T, such a heuristic selects a regular literal 
F occurring in F that maximizes J{F) + J{F), where 
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r U {|0 : 

{{|0 : P2}, 

{f2 :p2,fl :P3}, 
{f2 : P2,|0 : ps}} 

{4,0 : P2} 



ru{fl :pi} 



{Ul:p2},{t2:p3}, 
{f2 :p2,fl :P3}, 
{f2 : P 2,|0 : P 3 }}, 

{|1 : P2} 



{{t 1 : P 3 }, U 0 : P 3 }} {{t 2 : P3}, {f 1 : P3}, U 0 : ps}} 



{fl :P3} 

□ 



{t2:p3} 

□ 



Figure2. A proof tree created by Regular-DP 



J{L)= Y. 2-1^1. (1) 

3L' ■. L' CL 
L' €C € r 

Taking into account the work of [6] on the Jeroslow-Wang rule in the classical 
setting, we propose the following definition of J{L): 



J(L) = Y 

3L' ■. L' CL 
L' €C € r 

Hahnle’s definition of J{L) assigns a bigger value to those regular literals L 
subsumed by regular literals L' that appear in many small clauses. This way, 
when Regular-DP branchs on L, the probability of deriving new regular unit 
clauses is bigger. Our definition of J{L) takes into account the length of regular 
signs as well. This fact is important because regular literals with small signs have 
a bigger probability of being eliminated during the application of the regular 
one-literal rule. Observe that in the case that |A^| = 2 we get the same equation. 

In the following we will refer to the branching heuristic that uses (1) to 
calculate J{L) + J{L) as RH-1, and the branching heuristic that uses (2) as 
RH-2. In Section 6 we provide experimental evidence that RH-2 outperforms 
RH-1 on a class of randomly generated instances. 



n 

,s-.pec 



1^1 -1^1 
2(|iV|-l) 



(2) 
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4 The Regular-GSAT procedure 

Regular-GSAT is an extension of GSAT [10] to the framework of regular CNF 
formulas; its pseudo-code is shown in Figure 3. Regular-GSAT tries to find a 
satisfying interpretation for a regular CNF formula F performing a greedy local 
search through the space of possible interpretations. It starts with a randomly 
generated interpretation L If I does not satisfy F, it creates a set, say S, formed 
by the variable- value pairs {p, k) that, when the truth value that assigns 7 to p is 
changed to k, give the largest decrease (it may be zero or negative) in the total 
number of unsatisfied clauses of F. Then, it randomly chooses a propositional 
variable p' that appears in S. Once p' is selected, it randomly chooses a truth 
value k' from those that appear in variable-value pairs of S that contain p' . Next, 
it changes the assignment of the propositional variable p' to the truth value k' . 
Such changes are repeated until either a satisfying interpretation is found or a 
pre-set maximum number of changes (MaxChanges) is reached. This process is 
repeated as needed, up to a maximum of MaxTries times. 



procedure Regular-GSAT 

Input: a regular CNF formula F, MaxChanges and MaxTries 
Output: a satisfying interpretation of F, if found 
begin 

for i := 1 to MaxTries 

/ := a randomly generated interpretation for F- 
for j := 1 to MaxChanges 

if I satisfies F then return I; 

Let S be the set of variable-value pairs of the form (p, k) that, when 
the truth value that assigns / to p is changed to k, give the largest 
decrease in the total number of clauses of F that are unsatisfied; 
Pick, at random, one variable p from the set {p | (p, k) € S}-, 

Pick, at random, one value k’ from the set {k \ {p,k) 6 5}; 

I := I with the truth assignment of p' changed to k'-, 
end for 
end for 

return “no satisfying interpretation found”; 
end 



Figures. The Regular-GSAT procedure 



One difficulty with local search algorithms for satisfiability testing, and in 
particular with Regular-GSAT, is that they are incomplete and cannot prove 
unsatisfiability. Another difficulty is that they can get stuck in local minima. 
To cope with this problem, the most basic strategy is to restart the algorithm 
after a fixed number of changes like in Regular-GSAT. More strategies to escape 
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from local minima are described in [8, 9]. We plan to incorporate such strategies 
into Regular-GSAT in the near future. As far as we know, Regular-GSAT is the 
first local search algorithm that has been defined so far for finding models in 
many-valued logics. 

In the classical setting, GSAT has successfully solved problems that cannot 
be solved with the fastest systematic algorithms. In Section 6 we provide ex- 
perimental evidence that Regular-GSAT outperforms Regular-DP on a class of 
computationally difficult instances. 



5 Regular random 3-SAT problem 

In order to compare the performance of the algorithms and heuristics described 
above, we used instances of the regular random 3-SAT problem as benchmarks. 
This election was motivated by the fact that it is known that solving some of 
those instances is computationally difficult [1,7]. Moreover, a large number of 
instances can be generated easily. In this section, we first describe how to gener- 
ate regular random 3-SAT instances and then how to identify computationally 
difficult instances. 

Given a number of clauses (G), a number of propositional variables {V) 
and a truth value set (77), a regular random 3-SAT instance is produced by 
generating C non-tautological regular clauses. Each regular clause is produced 
by uniformly choosing three literals with different propositional variable from 
the set of regular literals of the form i : p or t j : p, where p is a propositional 
variable, i £ N \ {T}, j £ N \ {T}, and T and T denote the top and bottom 
elements of N. Observe that regular literals of the form J,T : p or tT : p are 
tautological. 

In [7] we reported experimental results on testing the satisfiability of regular 
random 3-SAT instances with Regular-DP using heuristic RHl. We observed that 
(i) there is a sharp phase transition from satisfiable to unsatisfiable instances for 
a value of the ratio of the number of clauses to the number of variables (y). At 
lower ratios, most of the instances are under-constrained and are thus satisfiable. 
At higher ratios, most of the instances are over-constrained and are thus unsat- 
isfiable. The value of y where 50% of the instances are satisfiable is referred to 
as the crossover point; (ii) there is an easy-hard-easy pattern in the computa- 
tional difficulty of solving problem instances as y is varied; the hard instances 
tend to be found near the crossover point; and (iii) the location of the crossover 
point increases as the cardinality of the truth value set increases. Recently, we 
have shown experimentally that such an increase is logarithmic [1]. Given the 
great difficulty of solving hard regular random 3-SAT instances of moderate size 
with systematic search algorithms, we believe that they are a suitable testbed 
for evaluating and comparing regular satisfiability testing algorithms. 

Figure 4 shows graphically the phase transition phenomenon for the regular 
random 3-SAT problem when |77| = 7 and V = 60. Along the vertical axis is 
the average number of nodes in the proof tree created by Regular-DP. Along the 
horizontal axis is the ratio y in the instances tested. One can clearly observe the 
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easy-hard-easy pattern as ^ is varied. The dashed line indicates the percentage 
of instances that were found to be satisfiable scaled in such a way that 100% 
corresponds to the maximum average number of nodes. 
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Figure4. Phase transition in the regular random 3-SAT problem 



6 Experimental Results 

In this section we report on a series of experiments performed in order to compare 

(i) Regular-DP with heuristic RH-1 against Regular-DP with heuristic RH-2, and 

(ii) Regular-DP with heuristic RH-2 against Regular-GSAT. Such experiments 
were performed on a PC with a 400 Mhz ALPHA Processor under Linux Oper- 
ating System. As already mentioned, the algorithms were implemented in C-F-F. 
All the instances we refer to below are regular random 3-SAT instances from the 
hard region of the phase transition. To be more precise, the ratio of the number 
of clauses to the number of variables in the instances tested corresponds to the 
crossover point. 



Experiment 1 

Table 1 shows the results of an experiment performed in order to compare the 
performance of Regular-DP with heuristic RH-1 and Regular-DP with heuristic 
RH-2. We considered regular random 3-SAT instances with \N\ = 3. Each pro- 
cedure was run on 100 satisfiable instances for each one of the values of C and 
V shown in the table. The average time needed to solve an instance, as well as 
the average number of applications of the branching rule, are shown in the table. 
The results obtained provide experimental evidence that, at least for the kind of 
instances tested, Regular-DP with heuristic RH-2 outperforms Regular-DP with 
heuristic RH-1. 
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V 


C 


Regular-DP with RFI-1 
branching rule time (secs.) 


Regular-DP with RFI-2 
branching rule time (secs.) 


80 


487 


424 


1 


281 


0.3 


120 


720 


4872 


20 


2655 


4 


160 


972 


55144 


290 


23566 


45 


200 


1230 


482028 


3152 


210560 


523 



Tablel. Comparison between Regular-DP with heuristic RF[-1 and Regular-DP with 
heuristic RF[-2. 



Experiment 2 

Table 2 shows the results of an experiment performed in order to compare the 
performance of Regular-DP with heuristic RH-2 and Regular-GSAT on satisfi- 
able, regular random 3-SAT instances with \N\ = 3 and \N\ = 4. Both proce- 
dures were applied to blocks of 100 satisfiable instances with 80, 120, 160 and 200 
propositional variables. In order to obtain more accurate results, each instance 
was run 50 times with Regular-GSAT. The first column contains the cardinality 
of the truth value set (|A'|), the second contains the number of propositional 
variables (V), and the third the number of clauses (C) in the instances tested. 
The remaining columns display the settings of MaxTries (MT) and MaxChanges 
(MG) employed by Regular-GSAT, the average number of tries needed to find 
a solution, the average number of changes performed in the last try, the aver- 
age time needed to solve an instance with Regular-GSAT and the average time 
needed to solve an instance instance with Regular-DP. The running time of each 
instance solved with Regular-DP corresponds to the time needed to solve that 
instance, whereas the running time of each instance solved with Regular-GSAT 
is the average running time of 50 runs on that instance. 



|iV| 


V 


C 


Regular-GSAT 


Regular-DP 
time (secs.) 


MT 


MC 


tries 


changes l.t. 


time (secs.) 


3 


80 


487 


100 


1000 


13 


526 


0.3 


0.3 


3 


120 


720 


200 


2800 


66 


1742 


6 


4 


3 


160 


972 


260 


6200 


97 


4015 


26 


45 


3 


200 


1230 


400 12000 


136 


7781 


100 


523 


4 


80 


566 


150 


2000 


31 


1024 


1 


0.5 


4 


120 


848 


280 


2800 


95 


2529 


12 


8 


4 


160 


1133 


400 


8000 


168 


5596 


52 


126 


4 


200 


1432 


600 28000 


270 


17969 


311 


1013 



Table2. Comparison between Regular-GSAT and Regular-DP. 



Looking at the average times, it is clear that Regular-GSAT scales better than 
Regular-DP when the number of variables in the problem instances increases. 
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Such results suggest that local search methods can extend the range and size of 
satisfiability problems that can be efficiently solved in many-valued logics. 

7 Concluding remarks 

In this paper we have presented several original results: (i) a new branching 
heuristic for Regular-DP, (ii) an extension of GSAT [10] to the framework of 
regular CNF formulas, and (iii) an experimental comparison of the algorithms 
and heuristics described in the preceding sections. For regular random 3-SAT 
instances of the hard region of the phase transition, our results provide experi- 
mental evidence that (i) Regular-DP with our heuristic outperforms Regular-DP 
with Hahnle’s heuristic [4], and (ii) Regular-GSAT outperforms Regular-DP. 
Such results suggest that local search methods can extend the range and size 
of satisfiability problems that can be efficiently solved in many-valued logics. 
In the near future, we plan to compare the performance of Regular-DP and 
Regular-GSAT for solving problems other than regular random 3-SAT and equip 
Regular-GSAT with more powerful strategies to escape from local minima. 
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Abstract. The purpose of this paper is to investigate a new methodol- 
ogy for default reasoning in presence of priorities which are represented 
by an arbitrary partial order on default rules. In a previous work, we 
have defined an alternative approach for characterizing extensions in pri- 
oritized default theories initially proposed by Brewka. We present here 
another approach for computing default proofs for query answering in 
prioritized normal default logic. 



1 Introduction 

Representing and reasoning with incomplete knowledge has gained growing im- 
portance in the recent decades. The literature contains many different approaches. 
Most of them are directly based on non-monotonic reasoning and Reiter’s default 
logic [7] seems to be one of the best known and most popular one. In default 
logic, knowledge is represented by means of a set W of closed formulas and a 
set D of default rules of the form where a ,/?, 7 are formulas respectively 
called prerequisite, justifieation and eonsequent. The defaults in D induce one or 
multiple extensions of the facts in W. Any such extension, E say, is a deductively 
closed set of formulas containing W such that for any G D, if a G E and 
-ift ^ E then j G E. 

Despite default logic’s popularity, we encounter deficiencies in the most gen- 
eral setting. For instance, default logic does not guarantee the existence of ex- 
tensions. Moreover, it does not address the notion of speeifieity, which is a fun- 
damental principle in common sense reasoning according to which more specific 
defaults should be preferred over less specific ones. For avoiding the first prob- 
lem, it is usually sufficient to restrict oneself to a subclass of default logic given 
by so-called normal default theories. Such theories consist of defaults of the form 
This class enjoys several desirable properties not present in the general case. 
Apart from guaranteeing the existence of extensions, normal default theories are 
semi-monotonic (ie. monotonic wrt the addition of defaults) and the resulting 
extensions are always mutually contradictory to each other. In particular the 
former property is very desirable from a computational point of view, since it 
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allows us to restrict our attention to default rules relevant for proving a query 
(cf. [7]). 

Unfortunately, the restriction to normal default theories (or any other sub- 
class) does not work for addressing the failure of specificity. As an example, 
consider the statements “typically penguins do not fly”, “ penguins are typically 
birds” and “ typically birds fly” along with the corresponding default theory ^ : 



Intuitively, starting from W, we should like to derive b and -■/ (“a penguin is 
a bird that does not fly”). We do not want to derive /, since this violates the 
principle of specificity. In fact, being a penguin is more specific than being a bird 
and so we want to give the first default priority over the third one. However, this 
theory gives two extensions, Th{{p,b,^f}) and Th{{p,b, f}), showing that also 
normal default theories cannot deal with specificity. 

A first attempt to deal with specificity has been proposed in [8] by means of 
semi-normal defaults ^ but the problem of non-existence of extension arises again. 
Recently, some approaches were developed for isolating specificity information 
(cf. [1, 3, 4]) out of a given set of defaults. In the above example these methods 
allow us to detect automatically that the default is more specific than 

the default This corresponds to the intuition sketched above. In [3], this 
specificity information is expressed by means of a strict partial order <. Then, 
expresses the fact that the first default is more specific than the 
second one. In this way, one obtains so-called prioritized normal default theories 
(denoted {A, W, <)), ie. normal default theories along with a strict partial order 
on the defaults. As example, consider our initial theory in (1). Now using the 
formalisms developed in [3,4], we fix a precedence between the first and the third 
default, ie. This yields the following prioritized default theory: 



(D,W,<) = 









n/ p : b b : / 

b ’ / 



I > |- 



f b: f 



< 



}) 



(2) 



For describing prioritized extensions, Brewka introduces in [3], a characterization 
of extensions respecting the specificity information given by such a partial order. 
This approach eliminates the second extension of default theory (1) and only 
gives the first one,T/i({p, 6, -■/}), that is more specific. 

This paper is organized as follows. Section 2, recalls a characterization of 
extensions from prioritized normal default theories given in [2]. Section 3 gives 
a characterization of default proofs for query answering in this context. 



2 Characterizing prioritized extension 

As defined in [3], a prioritized default theory is a triple (A, LF, <), where Zi is a 
finite set of normal defaults, W a set of formulas, and < a strict partial order on 

^ p stands for penguin, b for bird and / for fly. 

^ A default is semi-normal if it is like 

7 
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A. As mentioned in the introduction, a more specific default <5 gets priority over 
a less specific default S'; this is expressed by <5 < <5'. As usual in default logic, if 
(5 is a normal default then Prereq{S) = a and Conseq{S) = (3. 

Now, a central concept is that of activeness: 



Definition 1. [3] A normal default rule S is active in a set of formulas E iff 
Prereq{S) € E, -iConseq{S) ^ E, and Conseq{S) ^ E. 

By means of this notion, and for any total extension <C of <, [3] defines a 
Brewka’s prioritized extension (denoted in the following by B-extension) of 
(A,W,<) in the following way. 



Definition 2. Let (A, W, <) be a prioritized default theory and let he a strict 
total order containing <. Define Eq = Th{W) and for i > 0 



Ei+i — 



Ei 



if there is no S G A active in Ei , 



Th{Ei U {Conseq{S)}) 



otherwise, where S is the ‘^-minimal 
default rule that is active in Ei. 



Then, E is a B-extension of (A, W, <) iff E = [J^o 

In [2], we have defined an approach for computing extensions from priori- 
tized normal default theories. In order to avoid total orders <C for the prioritized 
extension characterization in Brewka’s method, we exclusively use in our ap- 
proach the strict partial order < which is augmented during the generation of 
an extension. This approach gives the same extensions as in Brewka’s method. 

By using the concept of activeness in Definition 1, we have defined the set of 
minimal active defaults as follows: 



Definition 3. [2] Let (A,W,<) be a prioritized default theory and let S be a 
set of formulas. The set of minimal active defaults is defined as follows: 

S is active in S and for each S' G A 
such that S' < S, S' is not active in S. 

Now, we give our characterization of a B-extension in prioritized normal default 
logic. 

Theorem 1. [2] Let (A, W, <) be a prioritized normal default theory and let E 
be a set of formulas. Define Eq = Th{W) , <o=<, and for i > 0 
if MAD{A, Ei, <i) = 0 then 

f Ei^i = Ei 

else for an arbitrary default S € MAD(A, Ei, <j) 

J Bj_|_i = Th{Ei U {Conseq{S)}) 

\ <i+i= {<i U{{S,Sj) I G MADiA, Ei, <i) \ 

(* represents the transitive closure) 

Then, E is a prioritized extension of (A, W, <) iff E = U^o 



MAD{A,S,<) = <^(5g A 
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The sequence of defaults {So, • • • , Sn) inducing the sequence {Ei, . . . , En+i) is 
called a set of ordered generating defaults OGD{A, E, <) where n is the smallest 
integer such that \fk > n, Ef. = En- 

In general, the two features distinguishing default rules from classical im- 
plications are the eharaeter of an inferenee rule and the additional eonsisteney 
eheek. While the latter is handled in default logic in the usual manner by testing 
satisfiability, the former needs a more subtle treatment: The character of an in- 
ference rule can be captured by the notion of groundedness defined in [12]. A set 
of default rules A is grounded in a set of facts W iff there exists an enumeration 
{Si)i^i of A such that 

W U Conseq{{So, ■ ■ ■ ,<5j_i}) h Prereq{Si) for i £ I (3) 

So, in regular default logic, groundedness and eonsisteney constitute the two 
characteristics, or better, the two qualifying conditions for the application and 
the use of default rules. In particular, these two notions allow for characterizing 
extensions of normal default theory in a very elegant way. As given in [12], 
an extension is characterized as Th{W U Conseq(A')) , the deductive closure of 
the facts W and the consequents of a maximal set of default rules A' which is 
grounded and preserves consistency^. 

But, what is the additional qualifying condition needed for incorporating 
priorities? Our idea, in this paper, is to isolate sequences of default rules that 
respect the given partial order. 

3 Query-answering in prioritized default theories 

3.1 Preliminaries 

Computing extension is central to default logic but sometimes we are only in- 
terested in knowing if a given formula belongs or not to some extension. Instead 
of computing, one by one, each extension of a default theory and checking if 
(f belongs to one of them, query answering is concerned with the search of a 
particular default sequence named a default proof. For this problem in normal 
default logic, Reiter has introduced the notion of default proof [7]. The reader 
can refer to [10, 11] for more details in query-answering in default logics. In order 
to provide a proof theory for prioritized normal default theories, first we give 
the definition of a classical default proof without preferences. 

Definition 4 (Classical defanlt proof). Let (A,W) be a normal default the- 
ory and Lp a formula. A elassieal default proof for Lp from (A, W) is a finite 
sequenee of default rules {Si)i^i sueh that 

— deduetion: W U {Conseq(Si) \ i £ 1} \~ (p 

— groundedness: W U Conseq({So, . . . ,<5j_i}) h Prereq(Si) 

® In normal default theories, consistency means that ITU Conseq{{Si, . . . ,Sn}) is 
consistent. 
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— regularity: W U Conseq({So, • • • , V ^Conseq{5i) 

Now, we give a definition of a default proof in presence of priorities. 

Definition 5 (Prioritized defanlt proof). The sequenee of defaults 
{Si, , Sn) is a prioritized default proof of (p in (A, W, <) iff 

— W U Conseq{{Si, . . . , <5„}) 1“ 

— there exists a total order <C issued from < and indueing a prioritized exten- 
sion E of (A,W, <) sueh that its ordered generating default set is 

OGD{A,E,<^) = {Si,...,Sn,...) 

So, as in classical normal default logic, a prioritized default proof is defined as 
the “ beginning” of a default sequence that generates an extension containing the 
formula ip. So, the prioritized default proof obviously satisfies the regularity and 
the groundedness conditions of definition 4. Let us note also that a prioritized 
default proof may contain some defaults that are not required for the deduction 
of (p, but they are necessary to respect the given priorities. 

At this point, it is important to note that introducing priorities into the 
query answering process of default logic is a rather difficult task because the two 
following points are difficultly compatible and mergeable. 

— Top-down: searching a default proof of p is done by starting from p and 
trying to reach W by using some necessary defaults. 

— Bottum-up: managing priorities is done by starting from W and firing de- 
faults according to the priorities between them. 

The reader can find in [2] a first methodology to compute such a prioritized 
default proof with priorities but here, our new proposal can be more efficient 
because it is local and does not need to examine all defaults in A to compute 
the answer of the query. This new method can also take advantage from a pre- 
compilation of some special prioritized default proofs. 

Let us end these preliminaries by introducing some notational conventions 
that we use in the following section. In order to provide an easier access of indices 
of default rules taken from a set {<5* | i € /}, we use a polymorphic function i 
returning the set of indices of all default rules contained in a given mathematical 
structure: 

L{Si) = {f} elements 

(.({<5i, . . . ,(5„}) = l{Si) U . . . U L{Sn) sets 

t((5i, . . . ,5„)) = l{Si) U . . . U l{S„) tuples 

L{{Si)iei) = U*G/ sequences 

By appeal to function l, we define then for any such structure the set of involved 
default rules as 

A{S) = {<5, I i e t{S)} . 

For instance, we thus have A{S 42 ) = {< 542 } and A({(5i, . . . ,<5,)) = {<5i, . . . ,<5,}. 
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3.2 Characterization of default proofs 

In this section we provide a methodology for computing a prioritized default 
proof. By using the concept of activeness in Definition 1, we define the concept 
of <-priority preserving sequences: 

Definition 6. Let (A, W, <) be a prioritized normal default theory. We eall a 
sequenee of default rules <-priority preserving (wrt W ) if 

AiCi Pi = $ for i G I 



where for i > 0 

Ai = {S G A \S is aetive in Th{W U Conseq({So, • • • , <5j_i})) } 
Pi={SGA\S<i 

<0 = < and <i+i=<i U{(<5j, <5)) | <5 G \ {<5*}} 

With this concept we obtain the following results. 

Proposition 1. Let (A,W, <) be a prioritized normal default theory. If {Si)i^i 
is a sequenee of default rules generating a prioritized extension of (A,W,<), 
then {Si)i^i is <-priority preserving (all wrt W ). 

Proposition 2. Let (A,W, <) be a prioritized normal default theory. If {Si)i^i 
is grounded, eonsistent and <-priority preserving (wrtW ), then {Si)i^i is a prefix 
of some sequenee of default rules generating a prioritized extension of (A, W, <). 

And then we obtain the following characterization of prioritized default proof 
for normal default theories in presence of priorities. 

Theorem 2. Let (A,W, <) be a prioritized normal default theory and Lp a for- 
mula. The sequenee of defaults is a prioritized default proof of Lp in 

(A,W,<) iff 



1. W U {Conseq{Si) \ i £ 1} Lp 

2. W U Conseq{{So , . . . , b Prereq(Si) 

3. W U Conseq{{So, . . . , <5j-i}) ^Conseq{Si) 

4. is < -priority preserving (wrt W) 

[11] describes how the query answering system XRay builds a default proof 
of a given formula in default logic without priorities. But, for prioritized query- 
answering, we cannot concentrate anymore on the default rules that allow for 
deriving a query. Whenever a putative default proof comprises default rules that 
are subject to <-preferred default rules, we must justify the usage of the former 
rules. This can be accomplished in two ways: Either by showing that we may 
coherently apply both default rules (those that are required by deduction and 
those that are required by priorities) , or by providing another default proof which 
blocks the <-preferred default rule and which is itself coherently combinable with 
the original default proof. 
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We address this issue by furnishing a frame for a default proof Such 

a frame guarantees that {Si)i^i is <-priority preserving; it is simply a set of <- 
preferred yet coherent default rules and default rules providing a blocking proof 
for <-preferred yet incoherent default rules. The overall construction is finally 
constrained by the consistency of the initial set of facts with the default rules in 
the original default proof and its frame. 

Consider a default rule Si £ A. Define its prior default rules as the set P, of 
default rules, such that Pi = {Sj £ A \ Sj < < 5 *}. This notation extends to an 
entire default proof {Si)i^i in the obvious way: Pi = {Sj £ A \ Sj < Si,i £ I}. 

Given a default proof {Si)i^i and the set of its prior default rules P/, we must 
distinguish in P/ between rules that actually conflict with the default proof at 
hand, and those that are compatible with it. While the former must be blocked 
so that the rules in apply, the latter can apply without any harm. Since 

these compatible prior default rules furnish a “frame” for we refer to 

them as a primary frame named F' . 

Formally, a primary frame F' for a default proof {Si)i^i is a subset of Pi such 
that W U {Conseq(Si) \ i £ I U i{F') U t(P")} V -Lj where F" is what we call a 
secondary frame provided by default rules that allow us to block the conflicting 
prior default rules. This is elaborated upon in the sequel and we say that F" 
contains blocking proofs. 

Intuitively, a blocking proof for <5 is a sequence of default rules such that the 
joint application of its rules denies the application of < 5 . Such a blocking proof 
provides a candidate for disabling the putatively applicable prior default rule S 
that conflicts with the original default proof. In order to become effective, for a 
prior default rule S £ {Pi \ F') such that <5 < Si, we define the set of blocking 
proofs <)(<^) as the set of all prioritized default proofs for ^Conseq{S) from 

This leads us to the following definition of blocking proofs. This definition is 
inspired from [5]. 

Definition 7. Let (A, W, <) be a prioritized normal default theory. Let S be a 
default, we define the set of blocking proofs 6(/i,<)((5) as the set of all prioritized 
default proofs for -iConseq{S) from {A,W,<). If B C A, then B £ 6(/i.<)((5) 
iff B satisfies the following conditions. 

1. W U Conseq{B) I — iConseq{S) , 

2. B is grounded in W, 

3. W VI Conseq{B) \f T, 

f. B is <-priority preserving (wrt W ). 

For further illustration, consider our initial example in (2) along with its blocking 
proofs given in (4): 

B(a,<){^^)=^, e(^.o(^) = 0, = {{P^)} (4) 
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For example, is the only blocking proof for because it is a possible 

prioritized default proof for ^Conseq(^A^^ ■ But, {^, does not constitute 

a blocking proof for the default because the sequence {^, is not <- 
priority preserving. 

Now, given the definition of blocking proofs, we can define the secondary 
frame. For a default rule Si G A and a given primary frame F', we define Si as 
the relation obtained by taking the Cartesian product of all default proofs in 
Bj = 6(/i,<)((5j) U 0 for Sj € {Pi \ F'), that is. Si = X jgqp,\/r/) 6j. We call Si a 
local support structure for Si under F'. We can see in the figure 1 that if S is an 




defaults in Pj \ F' 



sets of blocking proofs 



proofl • : • proof'f^ x ... x proofl ' proof’^'" 
S = (proofl^ ,...,proofl'‘) e Si 



Fig. 1. Local support structure. 



element of Si, then A{S) is a set of defaults that permits to block all defaults 
that are in P, \ F'. 

We extend this to entire default proofs in the following way. For a default 
proof {Si)i^i and one of its primary frames F', we define Sj by taking the Carte- 
sian product of all local support structures Si under F', that is, Sj =Xi^jSi . 
At this point, a tuple of local supports F" € Sj is then called a secondary frame 
for a default proof {Si)i^i under F'. Thus, the set of defaults A{F") denies all 
defaults that are more prior than defaults in the original default proof and that 
have to be blocked since they are incompatible with this proof. 

This leads us to the following result. 

Theorem 3. Let {A,W, <) he a prioritized normal default theory and Lp a for- 
mula. Let {Si)i^i be a classical default proof from {A,W). A finite sequence 
{Si)i^iui(F')ui(F") foF some primary frame F' for {Si)i^i and some secondary 
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frame F" for under F' is a prioritized default proof for Lp from (A, W, <) 

iff 

W U {Conseq{5i) \ i £ I}\- ip 

W U Conseq{{5o , . . . , F Prereq{Si) 

W {Conseq{5i) \ i G 7 U t(F') U t(F")} b" -L 

(<^i)*G/Ui(F')Ui(F") is <-priority preserving 

This result gives a new characterization of a prioritized default proof. It 
ensures that every prioritized default proof can be found by combining a given 
classical default proof, that does not take into account the priorities, with two 
special sets of defaults (primary and secondary frames) . The primary frame can 
immediately be computed from the set Pj and the secondary frame can be found 
in a set of precompiled prioritized default proofs of ^Conseq{6) for all 6 £ A. 

Example 1. Let’s consider the following default theory. 

W = {-.dV-.6} 

: a : c c : d a : b : ~^d ) 

IT’ d ’ b ’ -.d J 

( / c : d : a\ ( : c a : 6\ ) 

Let p> = b. Figure 2 illustrates the major step of the query computing from this 
theory. A classical default proof for 6 is P = (^, and its associated set 




proof for the query b 




a : b 
b 



> 



C : d 

prior defaults i 



blocking proof for the ( ) 

default rule 

a 



Fig. 2. Query computing tree from prioritized default theory in example 1 



of prior default rules is P/ = { ^ j ^ } • The default ^ is not consistent with 
P but it does not matter because at this point its prerequisite is not deducible. 
The default ^ must be introduced in P before So, we can try to combine 
P with F' = {^} and F" = But, that is not correct since the condition of 

In order to avoid heavy notations we identify the primary and secondary frames to 
set of defaults. 




Query- Answering in Prioritized Default Logic 



41 



<-priority preserving is not satisfied because ^ becomes active and it has to 
be introduced in P before Furthermore, ^ is not compatible with P since 
W U Conseq{P) U Conseq{^^) h _L. Then, we have to look for a blocking proof 
for this default. The sequence is a possible one. So that is why the local 

support structure for the default ^ is As a result, the primary frame F' 

for query 6 is a singleton set { ^ } and the secondary frame F" is the set { } 

because W U {Conseq[^) , Conseq[^)}U {Conseq[^^) , Conseq[^)} \/ T. Fi- 
nally, query b is proved from the theory {A,W, <) because the resulting default 
proof {^, is <-priority preserving and thus it is a prioritized de- 

fault proof. 

4 Conclusion 

In this work we have presented a new characterization of default proof in the 
framework of prioritized normal default logic. We have developed a methodology 
for query in the context of prioritized default theories which preserves priorities. 
As a result, this novel approach aims at having only a local treatment for query- 
answering and the resulting prioritized default proof contains only the default 
rules necessary for proving the given query and aims at minimizing computa- 
tional efforts. Furthermore, this work can easily be extended to justified default 
logic [6]. Further work is envisaged to implement this treatment of priorities in 
our query answering system XRay [11]. 
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Abstract. This paper deals with the knowledge representation and reasoning in 
directed belief networks. These networks are similar to those defined by Pearl 
(causal networks), but instead of probability functions, we use belief functions. 
Based on the work of Cano et aL [19921 which they have presented an 
axiomatic framework for propagating valuations in directed acyclic graph using 
Shafer-Shenoy’s axioms of valuation-based system (VBS), we show how the 
Dempster-Shafer theory fits in this framework. Then, we present a propagation 
algorithm in directed belief networks that is extended from Pearl’s algorithm, 
but it is expressed in terms of belief functions. 



1 Introduction 

In Artificial Intelligence, many applications of belief functions models used ad hoc 
methods for representing and manipulating belief functions. These methods paid no 
attention to the independence of the information being used. As an alternative to these 
methods, many researchers proposed to represent the structure of the model with a 
graph. The distinct advantage of this graphical representation is that it provides a 
picture showing how to represent knowledge about a problem integrating relations. 

There arc two types of graphical models commonly in use: undirected graphs and 
directed graphs. The difference between them is in the interpretation of the semantic 
of the edge. Although the undirected graph represents interactions (edges connect 
nodes that interact), the directed graph represents conditional relationships (direction 
of edges shows causal influence). As the direction of conditioning often agrees with 
the flow of causality, this makes the use of directed graph more appropriate. 

The concept of conditional independence relationships has been widely studied in 
probability theory (e.g. Dawid [1979], Pearl [1988], and Lauriteen et at [1990],...). 
Pearl starts with the conditional independence relationships when building his causal 
networks where conditional probabilities can be directly manipulated using Bayes’ 
theorem, and presents a propagation algorithm for these networks. However, in 
Dempster-Shafer theory, the concept of conditional independence has not been deeply 
treated. In the network using belief functions, the relations among the variables are 
generally represented by joint belief functions rather than conditional belief functions. 
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Recently, a number of techniques for handling conditional belief functions have been 
developed. Cano et al. [1992] have presented an axiomatic system for propagating 
uncertainty (including belief functions) in directed acyclic graphs. Smets [1993] has 
generalized the Bayes’ theorem for the case of belief functions and presented the 
disjunctive rule of combination for two distinct pieces of evidence, which makes it 
possible to represent knowledge by conditional belief functions and to use them 
directly for reasoning in belief networks. Shenoy [1993] has explored the use of 
graphical representation of valuation-based systems (VBS), called valuation 
networks, for representing conditional relations. Xu and Smets [1996] have presented 
an alternative framework to the general VBS, called evidential network with 
conditional belief functions (ENC) and have proposed a propagation algorithm for 
such networks having only binary relations between the variables. 

In this paper, we will be concerned with directed belief networks when uncertainty is 
expressed in form of belief functions. For this purpose, we adopt Pearl’s structure 
(directed acyclic graph), but instead of probability functions, we use belief functions. 
In order to evaluate these belief networks, we apply the disjunctive rule of 
combination and the generalized Bayesian theorem (Smets [1993]). Based on the 
axiomatic framework, called valuation-based system, proposed by Shafer and Shenoy 
[1988] for undirected graphs (hypergraph) and extended by Cano et al. [1992] for 
directed graphs (DAG), we can propagate belief functions in the network using local 
computation technique. 

The remainder of this paper is organized as follows. In Section 2. we recall the 
necessary background material and we give the correspondence between belief 
function representation and valuation-based system. In Section 3, we present the local 
computation axioms for the propagation of belief functions and we recall the two 
Smets ’s rules. Section 4 shows how to represent belief functions in directed acyclic 
graphs (DAG) and how to compute marginals. Finally, the propagation algorithm 
applied on a belief function network is presented in Section 5. 



2 Background Material 

Valuation-based system (VBS) is a framework for managing uncertainty in expert 
systems. In a VBS, knowledge is represented by function called valuation, which can 
be considered as the mathematical representation of a piece of information. Inference 
is made by two operations called combination and marginalization that operate on 
valuations. Combination corresponds to aggregation of knowledge. Marginalization 
corresponds to coarsening of knowledge. 

In this paper, we will particularize on a belief-function theory and we will show how 
this theory fits in the framework of VBS. For this purpose, valuation, combination 
and marginalization will be translated to their interpretation in the belief-function 
theory. 
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2.1 Valuatioa-Based System Framework 

The framework of VBS was first developed by Shenoy [1989] as a unified language 
for uncertainty representation and reasoning in expert systems. It is general enough to 
model different uncertainty theories, such as probability theory, possibility theory, 
belief-function theory. In this section, we remind the formal definitions of the basic 
concepts of a VBS. 

Vaiiables. Consider a finite set X of variables. Each variable Xj may range over a 
finite set of possible values , or simply 0^, called the frame for X.. Given a 
non-empty set X, of variables, where I C {l,..,n}, we denote by 0, the Cartesian 
product of the frames of the variables, i.e. 0, = x{0j iGI}. The elements of 0, are 
called configurations of Xj. 

Valuations. Given a set of variables X,. For each I C {l,..,n}, there is a set V,. The 
elements of V, are called valuations on 0,. Intuitively, valuation is the primitive 
object that represents uncertainty about a set of variables. 

Combination. If V, and are two valuations on 0, and 0j, respectively, then Vj0Vj 
is a valuation on 0, x 0^. Intuitively, combination is an operation to aggregate the 
information of two valuations in a single valuation. 

Marginalization. If V is a valuation on 0, and J C I, then V*' is a valuation on 0j. 
Intuitively, marginalization is an operation to narrow the focus of a valuation. 



2.2 Belief Function Representation 

The theory of belief functions, also called Dempster-Shafer theory, aims to model 
someone’s beliefs. The theory was first developed by Shafer [1976]. It is regarded as a 
generalization of probability theory (Bayesian approach). 

Definition!. A belief function model is defined by a set 0 called a frame of 
discernment and a basic probability assignment (b.p.a) function. A b.p.a is a function 
that assigns to every subset A of 0, a number m: 2® -»• [0,1] such that: m(A) ^ 0 for 
all A C 0, m(0) = 0, and I{m(A) | A C 0} = 1. 

Definition 2. Let 0 be the frame of discernment. The mapping bel : 2® -» [0,1] is a 
belief function iff there exists a basic probability assignment m such that: V A of 0, 
bel(A) = 5:{ni(B) | B C A}. 

Interpretation. A number m(A) measures the degree of belief that is exactly 
committed to A. Due to the lack of information, m(A) cannot support any more 
specific event. The value bel(A) quantifies the strength of the belief that the event A 
occurs. A subset A of 0 such that m(A) > 0 is called focal element of bel. bel is 
vacuous if the only focal element is 0. In Dempster-Shafer presentation, we assume 
that one and only one element of 0 is true {closed-world). However, in Smets 
definition, we accept that none of the elements could be true {open-world), so m(0) 
can be positive. In this case, bel is called an unnormalized belief function. 
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The correspondence between belie f-function theory and VBS framework could be 

expressed as follows; 

Valuations. Valuation is a basic probability assignment. 

Combination. To combine independent pieces of evidence, we use Dempster’s rule of 
combination. Suppose that bcli and beli are two belief functions over a frame of 
discernment 0, let mi and m 2 be their respective b.p.a functions. The direct sum of 
bell and bel 2 , denoted by bel=beli®bel 2 , is given by its b.p.a function m (mi©m 2 ): 
m(A) = k 2{m,(Aj)mj(AJ | Aj£0, A^C©, Aj(TAj=A} for all 0^AC0, where 
k ‘=l-2]{m,(A,)m2(AJ |A,C0, A^C©, AjDA2=0}, k is a normalization factor. 

Marginalization. If J C I and m is a basic probability assignment on 0„ then the 
marginal of m for J is the basic probability assignment on 0^ defined by: 
m^‘(A) = 2^{m(B) | BC0| such that = A} for all AC©^, 



3 Axiomatic Framework for the Propagation Process 

Given an evidential system defined by a set of variables X={Xj XJ and 

valuations expressed by belief functions bel,,...,bel„, we make inference by 
computing, for each variable X., the marginal of joint belief function (bel,®.. ®bel„)*' 
However, when there are many variables to manipulate, the computation of the joint 
belief functions becomes not feasible since the joint frame of variables is too large. 

Many researchers have studied this problem in order to propose techniques for 
computing marginals without explicitly calculating the global belief function. A well- 
known method is the local computation among the initial belief functions in Markov 
trees (Shafer et al. [1987], Shenoy and Shafer [1990]). The idea proposed by Shenoy 
and Shafer [1990] is that if combination and marginalization operations verify three 
axioms, then the local computations can be done. 

In this section, we show how to do computations with belief functions using local 
computation techniques in undirected and directed graphs. Then, we give a method 
analogous to Bayes theorem proposed by Smets [1993] in order to use it in the 
propagation process. 



3.1 Axioms for Local Computation in Undirected Graphs 

Shenoy and Shafer [1990] propose a set of axioms in which exact local computation 
of marginals is possible. These axioms provide necessary conditions for the 
development of propagation algorithm in hypergraphs. They are operated on basic 
probability assignments. 

Axiom Al. (Commutativity and associativity of combination) 

TTi|®mj = mj®m, and (m,® m^) ® m, = m,® (m^® m,). 

Axiom A2. (Consonance of marginalization) 

If I C J C K and m is a basic probability assignment on 0^^, then (m*V' = m*'. 
Axiom A3. (Distributivity of marginalization over combination) 

Let mj and m^ be two b.p.a functions on 0, and 0j, respectively, then 

(m,®ng‘‘=m,®m/‘'^''. 
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Interpretation. Axiom A2 tells us that the order in which the variables are deleted 
does not matter. Axiom A3 is of particular importance to the development of 
propagation algorithm in hypergraphs. Indeed, it allows us to compute (m,®mj)‘' 
without explicitly calculating (m,®mj). 



3.2 Axioms for Local Computation in Directed Graphs 

Three new axioms are added by Cano et al. [1992] to Shenoy-Shafer’s axiomatic 
framework, for the propagation in directed graphs. These axioms allow the definition 
of some aspects related with directed graphs. 

Axiom A4. (Neutral Element) 

There exists one and only one b.p.a m„ defined on ©|X...x0^ such that for every 
b.p.a m defined on 0„ V J C 1, m„*' ® m = m. 

Axiom AS. (Contradiction) 

There exists one and only one b.p.a m,, defined on 0jX...x0^ such that for every 
b.p.a m, m,,® m = m^. 

Axiom A6. 

For every m defined on the frame corresponding to the empty set of variable if 
m then m = 

Interpretation. The neutral element is the vacuous belief function defined by: 
mo(A)=l if A = 0ix...x0„ and mo(A) = 0 otherwise. 

The contradiction is a belief such that if it is combined with any other belief, it 
produces the contradiction. When we do not normalize, the contradiction is the zero- 
valued mass assignment defined by; mc(A) = 0 V AC 0i x ... x 0„. The meaning of 
axiom A6 is related to conditional independence concept (See Cano et al. [1992]). 

Definition 3. (Conditional belief function) 

A conditional belief function on 0j given ©j is defined as a belief function on 0i x 0j 
such that marginalizing it on 0| gives the neutral element. 

Example 1. A conditional belief function on ©2 given ©i is given by a basic probabilit 
assignment on ©1 x ©2 such that if it is marginalized on ©j, then we get the neutra 
element, like in this example: 

m({(e„,0^,(0,,9,J}) =0.7 
m ({ (0,„e J,(e„9^ }) = 0.2 L — ► m*''^ (0.) = 1 

m ({6n.0iJ) = m(0J = 0.1 J 

Definition 4. (Absorbent belief function) 

Absorbent belief function represents perfect information represented by m(A)=l, 
where A is a singleton. 

An example of absorbent belief function is when we combine two contradictory belief 
functions. 
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3.3 The Generalized Bayesian Theorem 



Let us consider two spaces 0,^ and 0^. We suppose that the additional information is 
given by be4(. | x) representing conditional belief function induced on the space 0., 
given X element of 0^. We want to compute bel^(x | y) for any xC0j^ and y£0^. This 
belief is derived from the generalized Bayesian theorem (GBT^. 

For given belief function bel: 2®-*[0,l], Smets [1993] defines a function b: 2®-*[0,l] 
such that: b(A)=bel(A)+ m(0). 



For any xC0j^ and yC0^. the GBT permits to built the belief function belx(x | y) by ; 

bel^(x I y) = b^(x I y)- b^(0 | y) (1) 



bx(x|y)= n b,(y|Xi) 

xiSc 

If each be4(. | x.) happens to be a probability function P(. | x) on 0^ and the prior 
belief on 0^ is also a probability function P„(x), then the normalized GBT is reduced 
to Bayes’ theorem : 



m(x, I y) = 



Po(xi) X P(y|xi) 

2 Po(xi) X P(y|xi) 

xiQ9x 



P(^i \ y) 



(2) 



Simultaneously, if we want to compute bely(y j x) for any xC0j^ and yCG^, we use 
another rule proposed by Smets [1993] called the disjunctive rule of combination 
(DRC). Indeed, the DRC permits to built the belief function bely(y | x) by : 



be4(y I x) = bY<y ] x) - b^(0 1 x) 



(3) 



bv(y|x)=n b^(y|x^ 

xiGx 



4 Belief Functions and Directed Acyclic Graphs 

Belief networks described by Shafer et al. [1987] are undirected hypergraphs where 
hyper-nodes represent sets of variables and hyper-edges are weighted with belief 
functions on the product space of the variables. In order to evaluate these belief 
networks, Shafer et al. [1987] have proposed a message-passing scheme for 
propagating belief functions using local computation. 

In Pearl’s approach (using probability functions) [1988], the edges are directed and 
weighted by the conditional probabilities over the child node given the parent nodes, 
rhe graphical structure used to represent relationships among variables are causal 
networks, called directed acyclic graphs (DAG) in which the vertices represent 
variables, the arcs show the existence of direct influence between the variables, and 
the strengths are expressed by conditional probabilities. 
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Like in our previous work (Ben Yaghlane et al. [1998], Khalfallah et al. [1998]), we 
adopt Pearl’s structure, but instead of probability functions, we use directed belief 
networks that are weighted by conditional belief functions. Indeed, consider the 
simplest directed belief network with two nodes X and Y defined on frames 0^ and 
©Y, respectively. Suppose that there exists some a priori information over 0^ given by 
belief function bel„,; (m„^ is the corresponding mass function) and some a priori 
information over 0^ given by belief function be(,Y (m^Y). We assume that we also have 
conditional belief functions {belY(. | x,) : x-EO^^}. 

For each node in the network, its marginal is computed by combining all the messages 
received from its neighbors and its own prior belief. So, if we want to compute bel^^ of 
the node X, which is the parent of Y, we combine its prior belief bel^^Y with the 
message coming from Y, i.e. 

belx = bel^ ® belY^x (4) 

where belY-.^ is a belief function on X, and is computed by : 

V xC0^, belY^^(x) = y m„Y(y) bel„(x I y) (5) 

such that bel)((x | y) is a posterior belief function given by the generalized Bayesian 
theorem (Formula 2). 

In the other hand, if we want to compute be^ of the node Y, which is the child of X, 
we combine its prior belief bel„Y with the message coming from X, i.e. 

belY = be(jY 0 bel^^^ (6) 

where bel^-y is a belief function on Y, and is computed by: 

V yC 0 ^, bel,,^Y(y) = ^ mox(x) belY(y | x) (7) 

such that belY(y | x) is a conditional belief ftmetion representing the relation between 
X and Y, and given by the disjunctive rule of combination (Formula 1). 

This computation is similar to the one proposed by Shafer et al.’s algorithm [1987], 
but the propagation between nodes is faster because the storage at the edge is smaller 
(with conditional belief functions, we store only 1 0 1 x2 ^ ^ values at worst case). 



5 Propagation of Belief Functions in Directed Acyclic Graphs 

In this section, we propose the propagation algorithm for directed belief networks that 
are quantified by conditional belief functions. The main idea of the proposed 
algorithm is to extend Pearl’s algorithm for belief functions and then to use the local 
propagation process for the computation of marginal belief functions rather than the 
calculation of global belief functions. As in Pearl’s structure, we suppose that the 
networks we are working with do not have loops. 
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For the need of the propagation algorithm, let U be a finite set of nodes of a given 
network. For each XGU, let P(X)CU be the set of parents of X, and C(X)CTJ be the 
set of children of X. For each node X, we store the available a priori belief function 
bcl|jj(.. We store also the conditional belief functions for each node given its parents 
{bel^(. [ Y) ; YGP(X)}. We assume that if a node X is instantiated then we have a new 
observation (OjJ, else is the vacuous belief function. 

Like in Pearl’s notations, each variable of the network has a k value and it value 
associated with it. In addition, each variable passes a k message to each of its parents 
and a it message to each of its children. In the following section, we present the 
propagation algorithm. 



5.1 Initialization Process 

In this step, we show how to compute the a priori belief functions of each variable 
(node) of the network using propagation method. 

Procedure INIT 

1 . Set all k value to vacuous belief function 

2. For all roots X, 

setir^^^^'x 

send a new ir^—y message for all children Y of X using formula (7) 

3. When a variable Y receives a new message from a parent, then compute 

the new Jiv value using this formula = belv ® ( 0 Ttv^y) 

-Y o T Y 'xeP(Y) 

the marginal belief bely = 

send a new it message for all its children using formula (7). 



Example 2. Let us consider the following directed belief network (Fig. 1.) constituted 
by 6 nodes U={A,B,C,D,E,F} representing the variables of the problem. For the sake 
of computational simplicity, all the variables used in this example are binary. 




Fig. 1. The state of the belief network before any variables are instantiated 
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Before any variables are instantiated, we perform the initialization step in which we 
apply the propagation method described above. 

Node A. (root) = m(A) 

- Compute and send message to B using formula (7). So, for any bC0^, the 
belief function bel^-» 3 (b) is given by its b.p.a m^^^ (b) as follows : 

m^-,3 (b) = 0.4 X 0.5 = 0.2 m,,^^ ( b ) = 0.425 m^^3(03) = 0.375 

Node E. As node E is also an another root, we do the same thing for it. 



Node B. When B receives a new jz message from its parent A 

- Compute its value and its new marginal in this manner : jXg = m(B) = 
m„ (b) =0.2 mg ( b ) = 0.425 m^ (©g) = 0.375 

- Compute and send Kg^^ message to C using formula (7). So, for any cC©^, the 
belief function belg_.(,(c) is given by its b.p.a mg-*^. (c) as follows : 

mg-j. (c) = 0.2 X 0.3 = 0.06 mg_p(c) = 0.34 mg-c(®c) ~ 

- Similarly, compute and send Jig_g message to D using formula (7). 



Node D. When D receives new rt messages from its parents B and D 
- Compute its rtg value and its new marginal ; TC„ = Jig-g 0 



After normalization, we obtain ; 

3ig(d) =mg(d2 =0.5442/0.8587 = 0.63 
jtg(d) = mg (d) = 0.1161 /0.8587 = 0.14 
ng(0g) = m„ (0g) = 0.1984 / 0.8587 = 0.23 

where k=0.8587 is a normalization factor. 

- Compute and send itg-p message to F using formula (7). 

Node C. and F. Similarly, we do the same computations for node C and F. 
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Fig. 2. The state of the belief network after initialization 
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5.2 Updating Process 

When new observation (information) is introduced at a given node, the updating 
algorithm will be performed. The general idea of this algorithm is based on the fact 
that when we arrive at a node X, we compute all the incoming messages, and then we 
calculate value, \ value, new marginal bel^, and all outgoing messages. 

Computing values 

For each node X of U, we compute the marginal bel,j value representing the marginal 
belief of node X. It is obtained by combining its initial value with the new observation 
(OJ and with the messages coming from all its parents and all its children: 
bel^= 

where riL = bel ® ( 0 JIy—x) ^‘nd X, = 0,®( 0 v) (8) 

YSP(X) ^ -XX ^2ec(X) '' 



Computing messages 

As the node X is updated, then it will send new messages to all its neighbors not yet 
been updated. These messages are computed as follows : 

• nix-Y representing the message sent from a node X to its children Y (formula 7): 

^x-r=bel^-Y where bel^..^(y) = y m,,(x) bel^(y | x) (9) 

xCdx 

such that bcl,.(y | x) is given by the disjunctive rule of combination (Formula 1). 

• representing the message sent from a node Y to its parents X (formula 5): 

where be4.,^(x) = Y m^(y) bel^(x | y) (10) 

such that belx(x | y) is a posterior belief function given by the GET (Formula 2). 
Propagation algorithm 

When a new observation (O,) is introduced at node X, we propose the following 
recursive algorithm to perform the updating : 

• X computes its new value bel,,, using formula (8). 

• For every child node ZEC(X), we calculate and send the new message 
using formula (9) to all children not yet been updated. 

• For every parent node YEP(X), we compute and send the new message 

using formula (10) to all parents not yet been updated. 

• Then, we select a new node X of U. 

This propagation algorithm ends when there are no nodes to update. 

Example 3. Suppose that we have the following simple belief network constituted by 
the nodes A, B, and C. After performing initialization, the state of the network is as 
follows: 
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ni(B) K "'(Q K 




Fig. 3. The state of the belief network after initialization 



Suppose now that new observation is introduced at a node A, where O^=(l,0,0). 
So, node A is instantiated and for updating the belief network, the propagation 
algorithm described above will be performed. 

Node A. A is instantiated 

Compute the new values for the node A. A has no children 

Compute and send to its parent B using formula (10). For any bC0g, bel,^-B(b) is 
given by its b.p.a: m,^-a(b)=lx0.2=0.2 )=0x0.5 = 0 =0.8 

Similarly, compute and send to C using formula (10) 

Node B. When B receives a new K message from A 
Compute its new marginal using formula (8). 
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After normalization, we obtain : mg (b) = 0. 74/0.98=0.75 mg( b )=0.08 raB(©g)= 0.17 
where k=l- 0.02=0.98 is a normalization factor. 

The node B has no children (other than A) and no parent. 

Node C. We do the same thing as node B. 



The final state of the belief network is given by Fig. 4. ; 




Fig. 4. The state of the belief network after A is mstantiated 





54 



Boutheina Ben Yaghlane and Khaled Mellouli 



6 Conclusion 

In this paper, we have presented the propagation algorithm for directed acyclic belief 
networks that are weighted by conditional belief functions. Using the generalized 
Bayesian theorem, the proposed algorithm allows us to perform updating when new 
information is introduced in the network. The representation used for these networks 
is similar to Pearl’s structure. This algorithm can be generalized by using other 
graphical representations, such that valuation networks. 
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Abstract. A popular approach to explanations amounts to backward 
chaining over logical implications encoding causal links. However, the re- 
sulting explanations are often unsatisfactory from a common-sense point 
of view. We define a framework allowing us to distinguish causal implica- 
tion from mere logical implication. Causal explanations are then deduced 
through two inference schemes so that explaining is in some way “less 
than implying” and “more than implying”. Finally, we show how our 
approach applies to diagnostics. 



1 Introduction 

Looking for explanations for facts or observations is a common-sense reasoning 
task which is central in AI, e.g. in learning, tutoring, revising knowledge or just 
arguing. It is also at the core of the diagnostics task which aims at explaining 
an observed behaviour. Abductive reasoning is the widely used logical approach 
in which explaining is defined merely as implying [14,12,1]. In the diagnostics 
field for example, abduction and then implication, is at the root of the abductive 
diagnoses as defined by [7] and of the explanatory diagnoses as defined by [2]. 

In this paper, we demonstrate that defining explaining as implying is not 
always satisfactory from a common-sense point of view. As remarked by Poole 
[15], a distinction between what is predicted to be true as opposed to explaining 
actual observations has to be made. This problem is particularly crucial when 
dealing with real applications as reported in [16,9]. 

Assume that your model predicts that fin manifests itself by a temperature 
between 38 and 42 degrees. Usually, the causal link “if X then U” is encoded as 
X ^ Y where denotes logical implication: 

flu {temp = 38 V • • • V temp = 42). 

Assume further that you observe “temperature is 40”. This fact cannot be 
explained by fin because flu does not imply temp = 40. Assume now that you 
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observe “temperature is greater than 36” instead. This fact is explained by flu 
as flu implies temp > 37 in view of flu {temp = 38 V • • • V temp = 42). All 
this is counter-intuitive but does not result from a problem of completeness or 
correctness of the model. The effect of ffu on temperature is perfectly described. 
The problem lies in the deffnition of explaining that fails to take into account the 
respective level of accuracy between what can be expected and what is observed. 

A related problem frequently occurs in real-world applications. For example, 
you expect a step on the observed signal. In the “explaining as implying” ap- 
proach, observing a sharp step will not be explained but observing evolution on 
the signal will be explained (provided that a step is known to be a special case 
of evolution). Again, this is not correct. 

We argue in this paper that explaining is both “more” than implying and 
“less” than implying.^ It is more than implying in that an observation has to 
be precise enough wrt what is expected in order to be regarded as explained. 
Direct effects (e.g., temp = 38 V • • • V temp = 42) of a cause are then formally 
distinguished from its logical consequences (e.g., temp > 37). It is also less than 
implying in that each element (e.g., temp = 39) of the set of expected values 
(and their consequences) is regarded as explained even if not logically implied. 

We propose a formal framework with deffnitions deriving from this common- 
sense view of explaining. Because we are concerned with causal explanations 
as opposed to explanations based on analogy for instance, we rely on a causal 
model CM describing the links between causes and their effects and on a logical 
theory W describing background knowledge (section 2). In section 3 and 4, 
explanations are formally deffned. In section 5, we show how this framework can 
be used for diagnostics. In section 6, we compare our proposal with related work 
about abduction and diagnostics. Finally, possible extensions are discussed and 
perspectives are sketched out. 

2 Causal models 

Causal information often consists of a series of statements of the form “X causes 
y”. Accordingly, causal information is usually represented by means of causal 
graphs where each statement “X causes T” is encoded as an arc from X to 
y. Technically, X and Y can be sets of vertices X = {xi, . . . ,x„} and Y = 
{yi) • • • )l/m} so that the arc from W to y actually means that Xi,. . . ,Xn when 
taken together cause one of yi , . . . , ym to happen. 

We focus on such causal information in a logical setting, by applying the 
natural correspondence: simply giving vertices the form of atomic formulas and 
giving arcs the form of implications. Beside causal information, we also take 
background knowledge into account. 

As a consequence, we are faced with fundamentally different statements: 
Obviously, “if ffu then fever” (ffu causes fever) is not of the same nature as “if ffu 
then disease” (ffu is a disease). We resort to modalities in order to encode causal 

^ By “more than implying”, we mean more constrained when compared to inference 
using classical implication. 
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implications and discriminate them from other (non-causal) implications to be 
found in background knowledge. That is, we will have the following formulas: 
flu disease 



and 



[C]flu [E]fever 

where [C] indicates a cause and [£1] indicates an effect. 

We represent a causal statement “X causes T” where X = {xi, . . . , Xn} and 
Y = {yi, . . . ,ym} hj a formula 



[C]{xi A...AXn)^ [E]{yi V ... V t/m). 



The semantics of such a formula is that yi V . . . V ym form an exhaustive range 
of alternative phenomena issued from X\ A ... A Xn (that is the cause under 
consideration). Indeed, “maximality” of V ... V is important and \E]a 
[E]{a V [i) does not hold. 

Such formulas [C]{xi A ... A Xn) [E]{yi V . . . V ym) make up the causal 
model, denoted CM . 

Background knowledge includes definitions, exclusions, and generally, all 
kinds of non-causal relations. Background knowledge, to be denoted by W, is 
then simply represented by non-modal formulas. 

Throughout the paper, we will use an example which comes from a real-world 
application but has been drastically simplified. The causal model describes a 
physical system in which a sliding of the flywheel (to be abbreviated as sof) 
causes a step in the vibration measurement signal, and in which any evolution 
in the vibration measurement signal causes an alarm to be displayed on the 
operator control screen, and in which a steering joint clamp (to be abbreviated 
as swc) causes a red light to be lit. All this is to be expressed as formulas in 
CM. 



( [C]sof [E]step, ) 

CM = < [C]evolution [E]alarm, > 

[ [C].swc ^ [E]red-Ught J 

It is also known that step and slow increase are two kinds of evolution of 
the vibration measurement signal and that a sharp step is itself a kind of step. 
Clearly, such background knowledge does not describe causal links and is given 
in IT. 



{ step evoluticni, | 

slow-inerease evoluticni, > 
sharp-step step J 

The contents of the example are depicted in figure 1 in which a simple arrow 
indicates a non-causal link (IT-link) and a double arrow indicates a causal link 
(CM-link). 
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sof evolution alarm swc 




sharp-step 



CM-link 
W- link 



Fig. 1. Graphical representation of the example 



3 Direct causes 

The logical language we consider is a propositional^ one enriched with two modal- 
ities^ [C] and [£i]. Embedded modalities are not allowed. 

Every formula \C]a {E](3 in a causal model CM is such that (3 is an 
atom-disjunction (an atom-disjunction is a disjunction of atomic formulas). 

The syntax for CM inherits the expressiveness constraints of causal graphs. 
Yet, such limitations can easily be overcome: A causal link with conjunction in 
the consequent “if X then Y and Z” can be represented by “if X then Y” and 
“if Y then Y’. 

In any case, directions for extending CM to include more general formulas 
are discussed in section 7. In particular, we mention “can cause” links as well as 
uncertainty and disjunctions between causal links. 

Through the properties held by [C] and [E] , we aim at defining an inference 
scheme for explaining which is both “less than implying” and “more than im- 
plying” . We introduce our approach in two steps, dealing only with “more than 
implying” in this section while delaying presentation of the full-fiedged system 
to section 4. 

The inference system hi is a standard system h, corresponding to all classical 
(hence, non-modal) rules, supplemented with the following items. 

^ For the sake of brevity, we may use variables to range over finite sets of values (e.g., 
3ztemp{z) is viewed as shorthand notation for temp{37) V ... V temp{42) and even 
for temp = 37 V ... V temp = 42 by slight abuse). 

® We warn the reader about not confusing our propositional operators for cause and 
effect with classical modalities as found in the literature on modal logics. 
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The modalities \C] and [E] enjoy simple inferences as expressed by the fol- 
lowing axioms: 

hi [C]{ay (i) ^ [C]ay[C](i 
hi [C]{a^(3) ^ [C]a^[C]|3 
hi \E]a A \E](3 -H- [-E](q: A (5) 

As already mentioned, the intended meaning of formulas [C]{xi A ... A Xn) 
\E]{yi V ... V ym) is that the set of alternative effects {yi, . . . is exhaustive. 
Accordingly, \/i [E]a [E](a V /3) and \/i [E]a V [E]/3 [E](a V /3). 

The modality [£1] satisfies: 

^i[E]aA[E](3 ^[E]{aV (3) (0) 

Deductions involving more than one causal link of CM need the rule: 

[T/]ct (W A ct) — y (3 , . 

Wp ^ ^ 

Careful: The above rule means that the system hi is implicitly parameterized 
by W. 

[C] is the way to distinguish reasoning with causal links (CM is used) from 
reasoning with mere logical implications (as given in W). Technically, (/i cr -H- 
[C]a. 

Direct causes are now defined as hypotheses that imply the formula according 
to causal links while not contradicting it. Hence the definition: 

Definition 1. Let CM be a eausal model. An atom- eon junetion S is a direet 
eause for a formula 'y iff S satisfies both eonditions below: 

- CM, [C]5 hi [E]^. 

- CM,[C]5,[E]^ b^i [C]T. 

7 is said to be a (direct) effect of 5. When the second condition is satisfied, 5 is 
a consistent hypothesis for 7 . 

Axiom (0) means that if each observation in a series is given an explanation, 
then the conjunction of these explanations is an explanation for the alternative 
consisting of exactly these observations. 

Rule (1) turns, if needed, an effect into an intermediate cause yielding further 
effects. That is, a chain of causal implications can be applied in order for end 
effects to be inferred. 

Example 1. We consider the aforementioned application: 

( [C]sof [E]step, ) 

CM = < [C]evolution [E]alarm, > 

[ [C].swc ^ [E]red-light J 

{ step evolution, ) 

slow-inerease evolution, > 
sharp-step step J 
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sof is a direct cause of step and alarm (and then, step and alarm are direct 
effects of sof): 

[C]sof, [C]sof [E]step hi [E]step. 

[E]step, step evolution hi \C]evoluticni. 

\C]evoluticni, \C]evoluticni \E]alarm hi \E]alarm. 

Moreover, sof A swc is a direct cause of step A red-light. By (0), sof A swc is a 
direct cause for stepV red-light. 

As shown above, step and alarm are direct effects of sof. By contrast, 
evolution is not a direct effect of sof . [C]sof hi [C]evoluticm but [C]sof \/\ 
\E]evoluticni. This is because evolution is not precise enough. The causal model 
provides no reason to consider sof as causing any kind of evolution unless it 
is a step. This is to say that evolution in general cannot be explained by sof . 
Howewer, evolution takes part in the causal chain from sof to alarm. Let us 
assume for a moment that the above example is encoded in the usual way: 
sof step. Then, sof would imply evolution (let alone really undesirable for- 
mulas such as stepV anything). So, hi is “more” than implying. 

Let us now highlight the following examples in which the direct causes just 
defined do not satisfy as explanations and which motivate the need of the infer- 
ence system developped in section 4. 

Example 2. W = % 

CM = { [C]flu -A- [E]{temp = 38 V • • • V temp = 42) } 

temp = 40 fails to be a direct effect of flu although temp = 38 V • • • V temp = 
42 is a direct effect of flu. 

Example 3. W = {sharp-step step{ 

CM = { [C]sof [E]step] 

Again, sharp-step is not a direct effect of sof although step is. Yet, we know 
by W that sharp-step is a kind of step. It then seems reasonable to consider 
that sof is a causal explanation for sharp-step , as that sharp-step is one of the 
alternatives in a direct effect of sof. 

That is, we need to identify “possible effects” of direct causes so that direct 
causes are viewed as explanations of such effects. We do this in the following 
section in which we define the inference system I -2 to define “explaining”. In 
some sense, I -2 will be at the same time “more” than implying and “less” than 
implying. 

4 Explanations 

The inference system I -2 extends hi with: 

[E]a {WA(3)^a 

a and (3 are atom-disjunctions such that W \/ a and W ^(3. 



(2) 
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Rule (2) makes it any of the alternatives described by a direct effect to be 
explained. In symbols, I -2 \E](ai V • • • V Q!„) \E\ai for i = 1 . . . n. 

In order to be explained, a formula should be deduced according to I -2 and 
the definition of an explanation is: 

Definition 2. Let CM be a eausal model. An atom-eonjunetion S is an expla- 
nation for a formula 7 iff 

- CM, [C]8 h 2 [E]^. 

- CM,[C]8,[E]^ [q±. 

The consistency condition is still stated using hi because having CM, [C]5 h 2 
[£1]T is quite normal: A cause can have two possible effects which contradict each 
other (for example, flu and its possible effects on the temperature). In symbols, 
flu \~2 [E]temp = 38 A [E]temp = 39 but flu, [E]{temp = 38 A temp = 39) hi 
[C]T (each of temp = 38 and temp = 39 is a possible effect of flu but flu cannot 
be an explanation for temp = 38 A temp = 39 because temp = 38 and temp = 39 
contradict each other). 

Observation. All direet eauses are explanations. The eonverse is untrue. 
Example f. Let us return to: W = {sharp-step step{ 

CM = { [C]sof [E]step{ 

sof is an explanation for step. W means that sharp-step is one of the alternative 
ways a step can manifest itself. So, sof is an explanation for sharp-step . 

[C]sof, [C]sof [E]step \~2 [E]step. 

[E]step, sharp-step step \~2 [E]sharp-step. 

Similarly, in the flu example, flu is an explanation for temp = 38 (as well 
as temp = 39,...). [C]flu,[C]flu [E]{temp = 38 V • • • V temp = 42) h 2 
[E]temp = 38. 

Let us assume for a time that the flu example is encoded in the usual way: 

flu {temp = 38 V • • • V temp = 42). 

Then, flu would not imply temp = 38. So, h 2 is “less” than implying. 

Sometimes, observations are not informative enough to be explained. A typ- 
ical case arises when the observation consists of a range of values, and the range 
is just too broad (in the flu example, consider having “temperature is between 
37 and 40, inclusive” as your observation). It is often interesting and useful to 
indicate whether the observation can make sense at all: There should be a way 
to detect that some explanation exists, on condition that the observation can 
be refined, at least in theory, so as to match the expectation (in the flu exam- 
ple, “temperature is between 37 and 40, inclusive” can be explained inasmuch 
as flu causes temperature to be 38 for instance). Explanations providing such 
indications are called conditional. They are defined as follows. 
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Definition 3. Given a causal model CM , an atom- conjunction S is a condi- 
tional explanation for a formula 'y iff S is a consistent hypothesis for 7 such that 

CM,[C]8h2 [C]7. 

Example 5. W = % 



CM = { [C]sof [E]{3x step{x) A x > 25) } 

3xstep{x) A a; < 30 has no explanation, which is quite normal. In fact [C]sof 
implies [E]{3x step{x) A x > 25) but this is not enough: In order to infer 
[E]{3x step{x) A X < 30), W should allow us to conclude 3xstep{x) A x > 25 
from 3xstep{x) Ax < 30, which obviously fails to be the case. For the same 
reason, 3xstep{x) Ax >20 has no explanation. However, we do have: [C]sof \~2 
[C](3x step(x) Ax < 30). [C]sof \~2 [C](3x step(x) Ax > 20). 

I.e., sof is a conditional explanation for 3xstep{x)Ax < 30 and 3xstep{x)Ax > 
20. In fact, both observations have a non-empty intersection with what is ex- 
pected, the first one overlapping and the second one being included in the set 
of expected values. The observation can be explained on the condition that, if 
refined, it would turn out to fall within the range of expected outcomes. 

5 Application to diagnostics 

Our first motivation when looking for a more adequate definition of explaining 
was the diagnostics problem. In this section, we investigate how the definitions 
(section 4) we propose for explaining can be applied to diagnostics. 

The aim of diagnostics reasoning is to find an explanation for an observed ab- 
normal system behaviour. Causal models have been used widely in the diagnosis 
field since the initial research of [18, 6]. The behaviour of a system is represented 
by describing the causal relations between states of the system, especially be- 
tween faults and their manifestations. The diagnoses are defined as explanations 
of the observed facts according to the causal model. 

Let us briefly review the usual logical approach to diagnostics. Causal in- 
formation X cause Y is encoded as an implication X ^ Y , the causal model 
CM consisting of these formulas. Contextual information (which is not to be 
explained but is used to explain observations) is encoded by means of classical 
formulas in a theory CXT. When needed, CXT is extended to a theory W in- 
volving background knowledge (definitions, exclusions, and/or other non-causal 
relations ([13]) in the form of classical formulas. Observations are represented 
by a set of classical formulas OBS. 

The diagnoses are defined as explanations of the observed facts according 
to the causal model. In the logical language, there are predefined predicates 
corresponding to observables and to abducibles (to use the terminology of [7]). 
Classically, a diagnosis is restricted to be a conjunction of abducibles, an abd- 
conjunction. An observation, written Obs, which often corresponds to a sensor 
value is described by a disjunction of observables, an obs- disjunction. The set of 
observed facts, OBS, is described by an obs-formula. 
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Definition 4. A diagnostics problem is a triple (CM,W,OBS) where CM is 
the causal model, W the background knowledge, and OBS a set of observations. 

The basic definition of abductive diagnosis ([7]) is the following: 

Definition 5. A diagnosis for the diagnostics problem {CM, W, OBS) is an 
abd- conjunction A such that: 

— CM, W, OBS, A ^ ^{consistency) 

- yobsi G OBS, CM, fT, Zi ^ Obsi 

The second condition clearly encodes “Zi explains OBS” as “Zi implies 
OBS”; that is the source of the problems highlighted in [9] and illustrated by 
means of the following example. 



Example 6. W = {3xstep{x) evoluticm} 



CM = 



Ai 3xstep{x) V peak, 
A2 3xstep{x), 

As 3xstep{x) A X > 25, 
A4 evoluticm, 

A4 peak, 

A4 vibration 



where evolution, step, peak et vibration are obs-predicates and Zii , Zi 2 , Zis, Zi 4 
are abd-predicates. 



Let OBS = {step(lO)}. None of Zii — Zi 4 implies step{10); no diagnosis can 
be found. 

We propose to use our definition for explaining (section 4) in order to show 
its applicability to the field of diagnostics. The causal links oi CM are encoded 
as \C]X \E]Y while W consists of non-modal formulas and we introduce OBS 
as a conjunction of non-modal formulas. 



Definition 6. A diagnosis for the diagnostics problem {CM,W,OBS) is an 
abd- conjunction A such that A is an explanation for OBS , i.e., according to the 
definition given in section 4, A is a consistent hypothesis for OBS such that 
CM, A h2 OBS. 



What about the example? For OBS = step{10), Ai, A2 and Zi 4 are diagnoses 
(and so are Zii AZi2, Zii AZi 4 , • • •). Given OBS = step{10) A {peakV vibraticm), 
A2 is no longer a diagnosis but Zii and Zi 4 are still diagnoses. Given OBS = 
step{10) Apeak A vibraticm, A4 is the only diagnosis for OBS (with Zi4 A Zii, 

Zi4 AZi2---)- 

This definition of abductive diagnoses can be extended, if needed, to condi- 
tional diagnoses using the definition of conditional explanations (section 4). 

* An extension of the above definition was proposed in [8] where observations are 
divided between observations the diagnosis has to explain and observations it has 
to be consistent with. It is not explained in detail here because it does not really 
change the point we want to make. 
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6 Related work 

The thrust of this paper is to question the usual definition of “explaining” as 
“implying” . We propose a formal framework allowing us to define “explaining” 
in a way closer to a common-sense view and corresponding at the same time 
to a weak form and a strong form of implication. We distinguish syntactically 
between the causal part of the model, describing what is physically at the source 
of observables, and background knowledge, which is descriptive knowledge of the 
domain: definitions, exclusions, hierarchical relations. 

The problem of finding explanations for observations which are too specific 
w.r.t. what is expected according to a given model is clearly stated by Poole [15]. 
The principle of parameterizing hypotheses is proposed as a solution. Knowing 
that a battery which is run down causes the voltage to be less than 1.2V, the 
causal link MB flat{B) 3V voltage{B ,V) A K < 1.2V is transformed into 

MB,V flat{B,V) voltage{B,V) f\V < 1.2. Due to such parameters, any 

consistent and precise voltage can be implied. However, as many parameters as 
needed must be added. This solution, which is obviously tricky, rapidly becomes 
inadequate, especially when symptoms are multivalued: Imagine what happens 
when one wishes to state that, due to a fault, it is possible for many steps to 
occur in the next 10 minutes. Parameterizing hypotheses becomes quite difficult. 

Kautz [10], in the framework of plan recognition, relies on an event hierarchy 
(abstraction and decomposition links) and considers all the “uses” of an observed 
fact before explaining it by abduction (the “use” relation being roughly the 
inverse of the component relation). 

Preist [17] uses default logic to handle causal models containing disjunction. 
His idea is to translate a causal link with disjunctive consequence into as many 
default rules as necessary. This approach relies on a specific non-monotonic for- 
malism. It shares similarities with our proposal in that it amounts to weakening 
classical implication when defining abduction. Yet, he does not directly propose 
a new way of defining explanations and he runs into problems similar to Poole’s. 

Many authors in model-based diagnostics have been faced with this problem 
when dealing with temporal information [13,3,16]). The main difficulty with 
temporal constraints is that most observations are temporally more accurate 
than is predicted. For example, you expect a step on the observed signal in the 
next 10 minutes and you observe it exactly 5 minutes later. That is exactly 
the kind of problem which motivates our work and which is not, as we show, 
specific to temporal information. To tackle this problem, [3, 13] propose to make 
a distinction between the logical part and the temporal part of an observation, 
requiring a diagnosis to imply the logical part but allowing it to be simply 
consistent with the temporal part. It is clear that only checking the consistency 
of temporal constraints yields in general less restrictive diagnoses than those 
obtained with our definition (see [9] for more details) and that these additional 
diagnoses are not interesting. Our proposal comes to grips in a more satisfying 
way with the specific case of temporal diagnoses while at the same time answering 
more generally a problem that arises whenever the language used to express the 
model and the observations allows for disjunctive forms. 
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[16] describes an application of temporal abductive diagnoses to nuclear 
power station maintenance. A causal model, involving temporal as well as arith- 
metical constraints, is used to find all the faults explaining the observations. Our 
work was originally motivated by this application and its illustration of how 
classical definitions were inadequate for such real applications. To tackle this 
problem, [16] suggest adding an observation model which achieves a tailor-made 
correspondence between by adding symptom h observaticm relations, which have 
no logical justification. The definition we propose responds to this problem and 
allows for the use of additional knowledge. 

[ 20 ] propose a framework for diagnostics systems extending the spectrum 
of definitions due to [ 8 ]. Diagnostic is defined using six parameters; one being 
the cover relation, noted hco«- Our definition of “explanation” can be seen as a 
specific cover relation and our diagnostics definition as an instantiation of their 
general scheme. 

Our proposal, by dealing with explanation in general instead of focussing on 
diagnostics, extends [9]. Moreover, it makes a major improvement by allowing 
possible effects to reenter the causal chain. Let us illustrate this point by looking 
at the following example: 

CM = I ^ [E]step, I 

I [C]sharp-step [E]alarm j 

W = {sharp-step step{. 

Not only is sharp-step explained as being a possible effect of sof (by rule (2)) 
but it reenters the causal chain (rule ( 1 )) and alarm is consequently explained 
by sof. This important point is not dealt with in [9] which relies on the relation 
existing between observations and what is defined as “symptoms”. In this case, 
the only “symptom” for sof is step, sharp-step is rightly considered as explained 
by being more precise than the symptom but it does not reenter as a possible 
cause, which prevents alarm from being explained by sof. Another point is that 
we pay attention to the syntactical form of the causal model CM . Let Ci : a 
causes (3 and C 2 '■ Oi causes /3 V 7 be two causal links in CM . In our current 
proposal, these two links are considered as being two distinct effects of a (two 
physically distinct ways of acting): the first one causes (3 and the second one 
causes, undeterministically, (3 or 7 . Such is not the case in [9] where the causal 
model is considered as a set of logical formulas and C 2 as redundant wrt C\ . 

In the field of abductive reasoning, some authors discuss the problem of 
preferring one explanation over another. For example, [5] are concerned with ab- 
ductive reasoning and abstraction. Defining abduction as usual through classical 
implication, they do not address the problem of explaining overly-specific and 
overly-general observations. They provide a way, related to Poole’s definition of 
less presumptive explanations [14] , of preferring explanations by taking abstrac- 
tion links into account. [4, 12, 1] study non-monotonic logics for explanations, as 
does [11] in still another way. [19] compare the notion of generality as a criterion 
for ordering explanations both in abduction and in induction. 
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7 Perspectives and conclusion 

Our approach can be extended in several respects. Just consider that causal 
information can be uncertain. An example is any statement expressing that 
a certain phenomenon possibly happens in a specific situation: “X can cause 
y” (meaning that X should not be ruled out when an explanation for Y is 
sought). In such a case, a usual solution amounts to introducing a so-called 
abstract condition alpha [7]. X is conjoined with alpha so as to form a causal link 
“X and alpha cause Y” . As a consequence, any explanation involving alpha is 
distinguished from presumably more dependable explanations. Uncertainty also 
occurs in statements of the form “X or Y cause Z” when this does not mean that 
X cause Z and Y cause Z (expressed by two causal links) but may rather mean 
that either X causes Z or Y causes Z although it is not known which is the actual 
cause of Z. Taking care that [C] have adequate properties and do not distribute 
over V, we can express that one of several causal links is the case. This goes 
beyond the expressiveness of causal graphs (where there is no way to state that 
one of several purported edges is indeed an edge) . More generally, tinkering with 
the interaction between the modal operators and disjunction makes it possible to 
handle disjunctive information wherever it occurs: causes, effects, observations. 

Our contribution is proposing a new definition for causal explanations. We 
depart from the usual formulation of abduction in logic: For us, explaining does 
not amount to going backwards through logical implication. Our approach con- 
sists in separating causal implication from mere logical implication by using 
model operators and in introducing two inference schemes so that inferring an 
explanation is in some way “less than implying” and “more than implying” . We 
argue that this definition is closer to the common-sense view of explaining. In- 
deed, consider observations that are more precise than the range of outcomes 
given in the causal links: We can both explain such observations and take account 
of those observations which are not precise enough, which get no explanation. 
Finally, we have shown how our approach nicely applies to diagnostics. 
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Abstract: In this paper we consider the problem of inducing causal relations from 
statistical data. Although it is well known that a correlation does not justify the claim of 
a causal relation between two measures, the question seems not to be settled. Research 
in the field of Bayesian networks revived an approach suggested in [16]. It is based 
on the idea that there are relationships between the causal structure of a domain and 
its corresponding probability distribution, which could be exploited to infer at least 
part of the causal structure from a set of dependence and independence statements. 
This idea was developed into the inductive causation algorithm [14]. We review this 
algorithm and examine the assumptions underlying it. 



1 Introduction 

If A causes B, an occurrence of A should be accompanied or (closely) followed 
by an occurrence of B. That causation implies conjunction is the basis of all 
reasoning about causation in statistics. But is this enough to infer causal relations 
from statistical data, and, if not, are there additional assumptions that provide 
reasonable grounds for such inference? These are the questions we discuss here. 

An appropriate framework for such a discussion is the theory of Bayesian 
networks. Research in this field is influenced from two directions. In the first 
place, Bayesian networks are studied on purely statistical grounds as one of 
several approaches to make reasoning in multi-dimensional domains feasible by 
decomposing the uncertainty information available about the domain [9] . Among 
such approaches, Bayesian networks [15] and Markov networks [11] are the best 
known probabilistic methods. Others include the more general valuation-based 
networks [19] and possibilistic networks [10,3]. 

Secondly, Bayesian networks are studied as descriptions of a structure of 
causal influences. Since they use conditional probability distributions which pos- 
sess an inherent direction, the idea suggests itself to “direct” the distributions 
in such a way that they represent the causal influences. Indeed, human experts 
often start from a causal model of the underlying domain and choose the condi- 
tional probability distributions of the Bayesian network accordingly. 

Therefore, in Bayesian networks, statistics and causal modeling are conjoined. 
This is emphasized by the d-separation criterion [15,4], which allows us to read 
the probabilistic dependences and independences from the causal structure un- 
derlying a Bayesian network. In the sequel an algorithm, the so-called inductive 
causation algorithm [14], was suggested to invert this procedure and to infer at 
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least part of the causal structure from observed dependences and independences. 
This algorithm and its assumptions form the core of our discussion. 

In section 2 we consider the connection of correlation and causation in gen- 
eral. Since a single correlation is not enough to infer a causal relation, we turn 
to the probabilistic and the causal structure of several variables in section 3. 
In section 4 we state the d-separation criterion and the stability assumption, 
which connect the causal to the probabilistic structure. In section 5 we review 
the inductive causation algorithm and, in section 6, discuss the assumptions 
underlying it. Finally, in section 7, we draw conclusions from our discussion. 



2 Correlation and Causation 

Correlation is perhaps the most frequently used concept in applied statistics. 
Its standard measure is the correlation coefficient, which assesses what can be 
called the intensity of linear relationship between two measures. Correlation is 
closely related to probabilistic dependence. However, the two concepts are not 
identical, because zero correlation does not imply independence. But since this 
difference is of no importance for our discussion, we use the term “correlation” 
in the vernacular sense, i.e., as a synonym for (probabilistic) dependence. 

Note that neither in the narrower statistical nor in the wider vernacular sense 
correlation is connected directly to causal relation. We usually do not know why 
a correlation exists or does not exist, only that it is present or not. Nevertheless 
such erroneous interpretation is tempting [5]: 

Much of the fascination of statistics lies embedded in a gut feeling — and never 
trust a gut feeling — that abstract measures summarizing large tables of data 
must express something more real and fundamental than the data itself. (Much 
professional training in statistics involves a conscious effort to counteract this gut 
feeling.) The technique of correlation has been particularly subject to such misuse 
because it seems to provide a path for inferences about causality. [...] [But t]he 
inference of cause must come from somewhere else, not from the simple fact of 
correlation — though an unexpected correlation may lead us to search for causes 
so long as we remember that we may not find them. [...] The invalid assumption 
that correlation implies cause is probably among the two or three most serious and 
common errors of human reasoning. 

It is easily demonstrated that indeed the vast majority of all correlations are, 
without doubt, noncausal. Consider, for example, the distance between the con- 
tinents America and Europe over the past twenty years (or any other suitable 
period). Due to continental drift this distance increases a few centimeters ev- 
ery year. Consider also the average price of Swiss cheese in the United States 
over the same period.^ The correlation coefficient of these two measures is close 
to 1, i.e., even in the narrow statistical sense they are strongly correlated. But 
obviously there is no causal relation whatsoever between them. 

^ We do not know much about the average price of Swiss cheese in the United States 
over the past twenty years, but we assume that it has risen. If it has not, substitute 
the price of any other consumer good that has. 
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Of course, we could have used also a lot of other measures that increased 
over the past years, for example, the distance of Halley’s comet (since its last 
visit in 1986) or the reader’s age. The same can be achieved with measures that 
decreased over the past years. Therefore, causality may neither be inferred from 
correlation with certainty (since there are counterexamples), nor even inferred 
with a high probability (since causal correlations themselves are fairly rare). 

According to these arguments it seems to be a futile effort to try to in- 
fer causation from observed dependences. Indeed, there is no way to causation 
from a single correlation (i.e., a dependence between two variables). But this 
does not exclude immediately the possibility to infer from a set of (conditional) 
dependences and independences between several variables something about the 
underlying causal influences. There could be connections between the causal and 
the probabilistic strueture, which enable us to discover the former at least partly. 

3 Probabilistic and Causal Structure 

From the point of view of statistics the basic idea underlying Bayesian networks 
is that a probability distribution P on a multi-dimensional domain can, under 
certain conditions, be decomposed into a set {Pi, . . . ,Pn) of (conditional) dis- 
tributions on lower-dimensional subspaces. Such a decomposition rests on two 
things: the chain rule of probability and a set of (conditional) independence 
statements. Let U = |A"i, . . . , X„) be a set of discrete random variables. Then 
the ehain rule of probability states that \fxi € dom(A"i), . . ., G dom(X„) : 

n 

P{XI,X2,. ■ ■ ,Xn) = '^P{Xi\xi,...,Xi-i). 

where P{x\, . . . ,a;„) is short for P{X\ = x \,. . . , = a;„), etc. If a set of con- 

ditional independence statements is given, this factorization can sometimes be 
significantly simplified. Thus one arrives at Va;i G dom(A"i), . . ., G dom(X„) : 

n 

P{X1,X2,. ■ ■ ,Xn) = Y\_P{Xi\'K^^^(Xi)<yXl, . . . ,Xi-i)), 

where Trs{I) denotes the projeetion of an instantiation 7 of a set of random 
variables to the variables in S and par(X,) C jXi, . . . , X,_i} is chosen in 
such a way that X, _LL jXi, . . . , X,_i}\par(X,) | par(X,), i.e., that \/xi G 
dom(Xi), ...,Xi G dom(Xj) : P{xi\TTpi,^(Xi){xi, . . -,Xi-i)) = P{xi\xi, . . .,Xi-i). 

Such a factorization is usually represented by a direeted aeyelie hypergraph, 
in which each node represents a random variable and each hyperedge represents 
a conditional probability distribution. We need hyperedges, which connect more 
than two nodes, since in general a variable is conditioned on more than one 
other variable. But since each node can have at most one hyperedge leading to 
it, one may also use a normal directed graph. In this case all parent nodes of 
a given node are in the condition part of the distribution for that node. This 
directed (hyper)graph we call a probabilistie strueture V . It is obvious that it is 
not unique, since it depends on the ordering of the variables. 
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Fig. 1. Causal (left) and probabilistic 
structure (right) of the lawn example 



We now turn to the causal structure of a domain. Our intuition of causation 
is perhaps best captured by a binary predicate “W (directly) causes Y” or “X 
has a (direct) causal influence on Y” , where X is the cause and Y the effect. This 
predicate is usually seen as antisymmetric, i.e., if “X (directly) causes T” holds, 
then “y (directly) causes X” does not hold. Thus there is an inherent direction 
in causal influence, which seems to be a characteristic property of causation. For 
the most part it is due to our intuition that a cause precedes its effect in time. 

Another formal interpretation is that an effect is as a function of its cause. 
But we reject this interpretation for several reasons. The first is that it brings 
in an assumption through the back door, which we want to make explicit (see 
section 6) . Secondly, a function is not necessarily antisymmetric and thus cannot 
always represent the direction of causation. Thirdly, if one variable is a function 
of another, then there need not be a causal connection (see section 2). Hence 
functional dependence and causal influence should not be identified. 

Because of the inherent direction, we can use a directed hypergraph to repre- 
sent causal influences. (Where a hyperedge shows that a conjunction of causes 
is needed, and separate (normal) edges show, that each of several causes can 
lead to an effect. However, usually no harm is done, if a hyperedge is split into 
a set of normal edges.) This structure we call the causal structure C. In princi- 
ple directed loops, i.e., circular causal influences, are possible. (Such cycles are 
often exploited for control mechanisms, for example Watt’s conical pendulum 
governor of the steam engine.) Nevertheless we do not consider circular causal 
structures, but assume that the causal influences form a directed acyclic graph. 

A very simple and often used example is the following [15]: If it rains (R), 
the lawn will get wet {W). But it will also get wet, if the sprinkler (5) is turned 
on. In addition, if it rains, we will not turn on the sprinkler. Obviously, both R 
and S have a causal influence on W and R has a causal influence on S (though 
mediated through a human). These influences are represented by the causal 
structure shown on the left of figure 1. The probabilistic structure shown on 
the right in figure 1 (which, of course, is not unique) is very close to the causal 
structure, its corresponding normal graph would be identical to it. 

4 d- Separation and Stability 

It is obvious that storing both a probabilistic structure as well as a causal struc- 
ture for a given domain is redundant. For instance, a cause and one of its direct 
effects should be dependent probabilistically. Thus the question arises, how the 
structures can be combined. The most promising approach seems to be to look 
for a method to read from the causal structure the independence statements that 
hold in the corresponding probabilistic structure. The best-known suggestion for 
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such a method is the so called d-separation criterion, of which it is claimed that 
it allows to determine whether two variables (or two sets of variables) are condi- 
tionally independent given a set S of variables: they are, if they are d-separated 
by S in the causal structure, d-separation is defined as follows [15,4]: 

Definition 1 . If R\, R2, and S are three disjoint subsets of nodes in a direeted 
aeyelie graph, then S is said to d-separate Ri from R2, iff there is no path from 
a node in Ri to a node in R2 along whieh the following two eonditions hold: 

1. every node with eonverging edges either is in S or has a deseendant in S, 

2. every other node is not in S. 

A path satisfying the eonditions above is said to be active; otherwise it is said to 
be blocked (by S). A path is a sequenee of eonseeutive edges (of any direetion). 

Note that the d-separation criterion does not say anything about the dependence 
or independence of R\ and R2 given 5, if R\ and R2 are not d-separated by S. 
Usually this is sufficient, if a Bayesian network is to be constructed, since for 
applications it is not essential to find and represent all independences. However, 
we need more to infer causal structure. Therefore it is assumed that in a sampled 
probability distribution P there exist exactly those independences that can be 
read from the causal structure C using d-separation. This assumption is called 
stability [14] and can be formalized as (i?i _U_ i ?2 | S' holds in P) (5 d-separates 
i?i and i ?2 in C), where R\, R2, and S are sets of variables. Note that the stability 
assumption states that there is “no correlation without causation” (also known 
as Reichenbach’s dictum), since between two variables that are dependent given 
any set of other variables, there must be a direct causal influence. 

An important property of d-separation and the stability assumption is that 
they distinguish a common effect of two causes from the mediating variable in 
a causal chain and from the common cause of two effects. In the structures 
A ^ B ^ C and A B ^ C , A and C are independent given B, but in the 
structure A ^ B C they are not. This alleged asymmetry, studied earlier in 
[16], makes the inferences of the inductive causation algorithm [14] possible. 

5 Inductive Causation 

Even with the d-separation criterion and the stability assumption there are usu- 
ally several causal structures that are compatible with the observed (conditional) 
dependences and independences. The main reason is that d-separation and stabil- 
ity cannot distinguish between causal chains and common causes. But in certain 
situations all compatible causal structures have a common substructure. The 
aim of the inductive causation algorithm is to And these invariant substructures. 

The only ingredients of the inductive causation algorithm apart from the 
d-separation criterion and the stability assumption are the notions of a latent 
structure and of its projection. A latent strueture is simply a causal structure 
in which some variables are unobservable (as it is often the case in real world 
problems). To handle such hidden variables, the notion of a projection of a latent 
structure is introduced. The idea is to restrict the number and influence of latent 
variables while preserving all dependences and independences. 
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Input: P, a sampled distribution over U, the universe of discourse. 

Output: core(P), a marked hybrid acyclic graph. 

1. For each pair of variables X and Y, search for a set Sxy C U\{X,Y~\ such that 
X _LL y I Sxy holds in P, i.e., X and Y are independent in P conditioned on Sxy- 
If there is no such Sxy, place an undirected edge between the variables. 

2. For each pair of non-adjacent variables X and Y with a common neighbour Z (i.e., 
Z is adjacent to X as well as to Y), check whether Z 6 Sxy- If it is not, add 
arrowheads pointing to Z, i.e., X — )• Z t— y. 

3. Form core(P) by recursively adding arrowheads according to the following two rules: 

• If for two adjacent variables X and Y there is a strictly directed path from X to 
y not including the edge from X to Y, then add an arrowhead pointing to Y- 

• If there are three variables X, Y, and Z with X and Y not adjacent, Y — Z, and 
either X — )• Z or X Z, then direct the link Z ^ Y- 

4. For each triplet of variables X, Y, and Z: If X and Y are not adjacent, Z ^ Y, 
and either X — )• Z or X Z, then mark the edge Z ^Y - 

Fig. 2. The Inductive Causation Algorithm [14] 

Definition 2. [14] A latent structure L\ is a projection of another latent struc- 
ture 1 / 2 , if and only if 

1- Every unobservable variable in L\ is a parentless common cause of exactly 
two non-adjacent (i-C-, not directly connected) observable variables- 

2- For every stable distribution P 2 which can be generated by L 2 , there exists a 
stable distribution Pi generated by Li such that\/X,Y € 0,5 C 0\{X, y} : 
(X _LL y I 5 holds in P 2 \o) (X _LL y | 5 holds in Pi\o), where O is the set 
of observable variables and P\o denotes the marginal probability distribution 
on these variables- 

(A stable distribution satisfies the stability assumption, i-C-, exhibits only those 
independences identifiable by the d-separation criterion-) 

It can be shown that for every latent structure there is at least one projection. 
Note that a projection must exhibit only the same (in) dependence structure 
(w.r.t. d-separation), but need not be able to generate the same distribution.^ In 
essence, the notion of a projection is only a technical trick to be able to represent 
dependences that are due to latent variables by bidirected edges (which are an 
intuitive representation of a hidden common cause of exactly two variables). 

One thus arrives at the inductive causation algorithm [14] shown in figure 2. 
Step 1 determines the variable pairs between which there must exist a direct 
causal influence or a hidden common cause, because an indirect influence should 

^ Otherwise a counterexample could easily be found: Consider seven binary variables 
A, B, C, D, E, F, and G, i.e., dom(A) = dom(P) = . . . = dom(G) = {0, 1}. Let 
A be hidden and E = A - B, F = A - C, and G = A - D- A projection of this 
structure contains three latent variables connecting E and P, E and G, and F and 
G, respectively. It is easy to prove that such a structure cannot generate a stable 
probability distribution that can be generated by the original structure. 
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enable us to find a set S that renders the two variables independent. In step 2 
the asymmetry inherent in the d-separation criterion is exploited to direct edges 
towards a common effect. Part 1 of step 3 ensures that the resulting structure 
is acyclic. Part 2 uses the fact that Y ^ Z is impossible, since otherwise step 2 
would have already directed the edge in this way. Finally, step 4 marks those 
unidirected links that cannot be replaced by a hidden common cause (based on 
similar grounds as part 2 of step 3). The output core has four kinds of edges: 

1. marked unidirected edges representing genuine eausal influenees (which must 
be direct causal influences in a projection), 

2. unmarked unidirected edges representing potential eausal influenees (which 
may be direct causal influences or brought about by a hidden common cause) , 

3. bidirected edges representing spurious assoeiations (which are due to a hidden 
common cause in a projection), and 

4. undirected edges representing unclassifiable relations. 

6 Critique of the Underlying Assumptions 

In this section we discuss the assumptions underlying d-separation and stability 
by considering some special cases with only few variables. The simplest case are 
causal chains, like the one shown in figure 3. a. If a variable has a direct causal 
influence on another, they should be dependent at least unconditionally, i.e., 
A ^ B \ % and B _)i_ C | 0. It is aiso obvious, that A AL C \ {B}. A direct 
cause, if fixed, shouid shieid the effect from any change in an indirect cause, 
since a change in the indirect cause can influence the effect oniy by changing 
the direct cause. But to decide whether B and C are dependent given A or not, 
we need to know the causai influences in more detaii. For instance, \{ B = f{A) 
and C = g{B), then B IL C \ A. But if the vaiue of A does not compieteiy 
determine the vaiue of B (just as the rain did not compieteiy determine the state 
of the sprinkier in the iawn exampie), then B and C wiii usuaiiy be dependent. 
Aithough the former is not uncommon, the stabiiity assumption exciudes it. 

The next cases are diverging or converging causai influences, iike those shown 
in figures 3.b and 3.c. The main probiems with these structures are whether 
B AL C \ {A} (in 3.b) and A AL B \ {C} (in 3.c) hoid or not. The assumptions 
by which d-separation and the stabiiity assumption handie this difficuity are: 

Common Cause Assumption (Causal Markov Assumption). 

Given all of their (direct or indirect) common causes, two effects are independent, 
i.e., in figure 3.b the variabies B and C are independent given A. If B and C are 
stiii dependent given A, it is postuiated that either B has a causai influence on 
C or vice versa or there is another (hidden) common cause of B and C (apart 
from A). That is, the causai structure is considered to be incompiete. 




A Critique of Inductive Causation 



75 



Fig. 4. Interaction of common cause 
and common effect assumption 




Common Effect Assumption. 

Given one of their (direct or indirect) common effects, two causes are dependent, 
i.e., in figure 3.c the variables A and B are dependent given C. For applications, 
this assumption is less important than the previous one, since nothing is lost, if 
it is assumed that A and B are dependent given C though they are not. Only 
the storage savings resulting from a possible decomposition cannot be exploited. 



Note that the common cause assumption necessarily holds, if causation is inter- 
preted as functional dependence. Then it only says that fixing all the arguments 
that (directly or indirectly) enter both functions associated with the two effects 
renders the effects independent. But this is obvious, since any variation still pos- 
sible has to be due to independent arguments that enter only one function. This 
is the main reason why we rejected this interpretation of causation. It is not at 
all obvious that causation should satisfy the common cause assumption. 

A situation with diverging causal influences also poses another problem: Are 
B and C independent unconditionally? In most situation they are not, but if, for 
example, dom(A) = {0, 1, 2, 3}, dom(B) = dom(C') = {0, 1} and B = A mod 2, 
C = A div 2, then they will be. The stability assumption rules out this possibility. 

The two assumptions also interact and this can lead to a priority problem. 
For example in figure 4: Given A as well as D, are B and C independent? The 
common cause assumption affirms this, the common effect assumption denies it. 
Since the stability assumption requires B and C to be dependent, it contains the 
assumption that in case of a tie the common effect assumption has the upper 
hand. Note that from strict functional dependence B ^L C \ {A, D} follows. 

In the following we examine some of the assumption identified above in more 
detail, especially the common cause and the common effect assumption. 
Common Cause Assumption (Causal Markov Assumption) 

Consider an arrangement of tubes like the one shown in figure 5. a. If a ball is 
dropped into this arrangement, it will reappear at one of the two outlets. If we 
neglect the time it takes the ball to travel through the tubes, we can define three 
binary variables T, L, and R indicating whether there is a ball at the top T, at 
the left outlet L or at the right outlet R. Obviously, whether there is a ball at T 
or not has a causal influence on L and on R. But L and R are dependent given 
T, because the ball can reappear only at one outlet. 

At first sight the common cause assumption seems to fail in this situation. 
However, we can always assume that there is a hidden common cause, for in- 
stance, an imperfectness of the ball or the tubes. If we knew the state of this 
cause, the outlet at which the ball will reappear could be determined and hence 
the common cause assumption would hold. Obviously, if there is a dependence 
between two effects, we can always say that there must be another hidden com- 
mon cause. We just did not And it, because we did not look hard enough. Since 
this is a statement of existence, it cannot be disproven. 
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Fig. 5. a) Y-shaped tube arrangement into which a ball is dropped. Since it can reap- 
pear only at L or at R, but not at both, the corresponding variables are dependent, 
b) Billiard with round obstacles exhibits sensitive dependence on the initial conditions. 

The idea that, in principle, we could discover the causes that determine the 
way the ball goes is deeply rooted in the mechanistic paradigm of physics, which 
is perhaps best symbolized by Laplace’s demon. ^ But quantum theory suggests 
that such a view is wrong [1, 13]: It may very well be that even if we look hard 
enough, we will not find a hidden common cause to explain the dependence. 

To elaborate a little: Among the basic statements of quantum mechanics 
are Heisenberg’s uncertainty relations. One of these states that Ax ■ Ap^ > |. 
That is, we cannot measure both the location x and the momentum px of a 
particle with arbitrary precision in such a way that we can predict its exact 
trajectory. There is a finite upper bound due to the unavoidable interaction 
with the observed particle. However, in our example we may need to predict the 
exact trajectory of the ball in order to determine the outlet with certainty. 

The objection may be raised that | is too small to have any observable in- 
fiuence. To refute this, we could add to our example an “uncertainty amplifier” 
based on the ideas studied in chaos theory, i.e., a system that exhibits a sensitive 
dependence on the initial conditions. A simple example is billiard with round 
obstacles [17], as shown in figure 5.b. The two trajectories of the billiard ball b, 
which in the beginning differ only by about degree, differ by about 100 de- 
grees after only four collisions. (This is a precisely computed example, not a 
sketch.) Therefore, if we add a wider tube containing spheres or semi-spheres in 
front of the inlet T, it is plausible that even a tiny change of the position or the 
momentum of the ball at the new inlet may change the outlet at which the ball 
will reappear. Therefore quantum mechanical uncertainty cannot be neglected. 

Another objection is that there could be “hidden parameters” , which, if dis- 
covered, would remove the statistical nature of quantum mechanics. However, 
as [13] showed"^, this is tantamount to claiming that quantum mechanics is false 
— a claim for which we do not have any convincing evidence. 

® Laplace wrote [7]: “We may regard the present state of the universe as the effect of 
its past and the cause of its future. An intellect which at any given moment knew all 
the forces that animate nature and the mutual positions of the beings that compose 
it, if this intellect were vast enough to submit the data to analysis, could condense 
into a single formula the movement of the greatest bodies of the universe and that of 
the lightest atom: for such an intellect nothing would be uncertain; and the future 
just like the past would be present before its eyes.” 

* V. Neumann wrote: “[...] the established results of quantum mechanics can never be 
re-derived with their [the hidden parameters’] help. In fact, we have even ascertained 
that it is impossible that the same physical quantities exist with the same function 
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Table 1. Death sentencing and race in Florida 
1973-1979. The hypothesis that the two vari- 
ables are independent can be rejected only 
with an error probability greater than 7.8% 
(according to a test). 

Table 2. Death sentencing and race in Florida 
1973-1979, full table. For white victims the hy- 
pothesis that the two other variables are inde- 
pendent can be rejected with an error proba- 
bility less than 0.01% (according to a x^ test). 



Fig. 6. Core inferred by the inductive causa- 
tion algorithm for the above data. 

According to Salmon [18], it seems to be hard to come up with an example in 
which the common effect assumption does not hold. Part of the problem seems 
to be that most macroscopic phenomena are described by continuous real- valued 
functions, but there is no continuous n-ary function, n > 2, which is injective 
(and would be a simple, though not the only possible counterexample). 

However, there are real world examples that come close, for instance, statis- 
tical data concerning death sentencing and race in Florida 1973-1979 (according 
to [8] as cited in [22]). From table 1 it is plausible to assume that murderer and 
sentenee are independent. Splitting the data w.r.t. vietim shows that they are 
strongly dependent given this variable (see table 2). Hence the inductive cau- 
sation algorithm yields the causal structure shown in figure 6. But this is not 
acceptable: A direct causal influence of sentenee on vietim is obviously impos- 
sible (since the sentence follows the murder in time), while a common cause is 
hardly imaginable. The most natural explanation of the data, namely that vietim 
has a causal influence on sentenee, is explicitly ruled out by the algorithm. 

This example shows that an argument mentioned in [14] in favour of the 
stability assumption is not convincing. It refers to [20], where it is shown that, 
if the parameters of a distribution are chosen at random from any reasonable 
distribution, then any unstable distribution has measure zero. But the problem 
is that this is not the correct set of distributions to look at. When trying to infer 
causal influence, we have to take into account all distributions that eould he 
mistaken for an unstable distribution. Indeed, the true probability distribution 
in our example may very well be stable, i.e., murderer and sentenee may actually 



connections [...], if other variables (i.e., “hidden parameters”) should exist in addition 
to the wave functions. Nor would it help if there existed other, as yet undiscovered 
physical quantities, [...], because the relations assumed by quantum mechanics [...] 
would have to fail already for the known quantities [...] It is therefore not, as often 
assumed, a question of a re-interpretation of quantum mechanics, — the present sys- 
tem of quantum mechanics would have to be objectively false, in order that another 
description of the elementary processes than the statistical one be possible.” 
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be marginally dependent. But the distribution in the sample is so close to an 
independent distribution that it may very well be confused with one. 

In addition, the special parameter assignments leading to unstable distributions 
may have high probability. For example, it would be reasonable to assume that 
two variables are governed by the same probability distribution, if they were 
the results of structurally equivalent processes. Yet such an assumption can lead 
to an unstable distribution, especially in a situation, in which common cause 
and common effect assumption interact. For instance, for a Fredkin gate [2] 
(a universal gate for computations in conservative logic, see figure 7), the two 
outputs C and D are independent, if the two inputs A and B assume value 1 with 
the same probability. In this case, as one can easily verify, the causal direction 
assigned to the connection A — C depends on whether the variables A, B, and 
C or the variables A, C, and D are observed. 



7 Conclusions 

The discussion of the assumptions underlying the inductive causation algorithm 
showed that at least some of them can be reasonably doubted. In addition, the 
inductive causation algorithm cannot deal adequately with accidental correla- 
tions. But we saw in section 2 that we sometimes reject a causal explanation in 
spite of the statistical data supporting such a claim. In our opinion it is very 
important for an adequate theory of causation to explain such a rejection.® In 
summary, when planning to apply this algorithm, one should carefully check 
whether the assumptions can be accepted and whether the underlying interpre- 
tation of causality is adequate for the problem at hand. 

A related question is: Given a causal relation between two variables, we are 
usually much more confident in an inference from the state of one of them to the 
state of the other than we would be, if our reasoning was based only on a number 
of similar cases we observed in the past. But the inductive causation algorithm 
infers causation from a set of past observations, namely a sampled probability 
distribution. If the result is not substantiated by other means, can we be any 
more confident in our reasoning than we would be, if we based it directly on 
the observed correlations? It seems to be obvious that we can not. Hence the 
question arises whether the inductive causation algorithm is more than just a 
heuristic method to point out possible causal connections, which than have to be 
further investigated. Of course, this does not discredit the inductive causation 
algorithm, since good heuristics are a valuable thing to have. 

® An approach to causation that does not suffer from this deficiency was suggested by 
Lorenz and later developed e.g. in [21]. It models causal connections as a transfer of 
energy. [12] suggests a closely related model. 
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Abstract. This paper reviews and relates two default reasoning mech- 
anisms, lexicographic (lex) and maximum entropy (me) entailment. Me- 
entailment requires that defaults be assigned specific strengths and it is 
shown that lex-entailment can be equated to me-entailment for a class 
of specific strength assignments. By clarifying the assumptions which 
underlie lex-entailment, it is argued that me-entailment is a superior 
method of handling default inference for reasons of both expressiveness 
and objective justification. 



1 Introduction 

The most widely accepted extension to a set of defaults is its p-closure [6] which 
is the fixed point result of applying the rules of System P. The p-closure contains 
all defaults which can be probabilistically entailed in the sense of Adams [1] . But 
the p-closure is too conservative to sanction common patterns of nonmonotonic 
reasoning such as the ability to ignore irrelevant information or to allow inheri- 
tance to exceptional subclasses. Lehmann and Magidor’s rational closure [8], or 
equivalently Pearl’s System Z [10], succeeded in solving the first problem but 
the inheritance problem requires more sophisticated machinery. 

This paper examines two systems which have been proposed to deal with the 
exceptional inheritance problem. Lexicographic (lex) entailment [2, 7] (section 
2.3) which is justified by presumptions of typicality, independence, priority and 
specificity, and maximum entropy (me) entailment [4, 3] (section 3) which uses 
the principle of maximum entropy as a means of selecting the least biased proba- 
bility distribution associated with an incomplete set of probabilistic constraints. 
Both systems are described and shown to exhibit the required behaviour. 

It is shown (section 4) that it is possible to recreate the lexicographic closure 
of a set of defaults under maximum entropy by assigning appropriate strengths to 
the defaults. An algorithmic definition is given which translates the lex-ordering 
into an me-ranking and hence finds a set of canonical me-strengths for the de- 
faults. This implies that lex-entailment can be thought of as a subset of me- 
entailment corresponding a particular choice of strength assignments. 

The dynamic behaviour of the system of lex-entailment is examined (section 
5). It is shown that the semantics of a default, when interpreted as its canonical 
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me-strength, is highly dependent on its surrounding defaults with respect to the 
lex-ordering. Under maximum entropy, however, a default’s semantics can be 
fixed and independent of other defaults. This finding is used to argue that the 
lex-ordering requires the user to accept some rather strong assumptions. 

By connecting the two systems, the intuitions underlying lex-entailment are 
clarified, and, it is argued, the more general approach of me-entailment is both 
more expressive, since it allows variable strength defaults to be represented ex- 
plicitly, and more justifiable, by virtue of its grounding in a well-understood 
principle of reasoning rationally from incomplete information. 

2 Lexicographic entailment 

2.1 Definitions and notation 

First some preliminary definitions and notation. A finite propositional language 
C is made up of propositions a, b, c, . . .and the usual connectives A, V, 

A default is a pair of propositions or formulas joined by a default connective =>, 
e.g., a => b. The language has a finite set of models, M. A model m verifies a 
default a => b ii m \= a A b, where |= is classical entailment, and falsifies it if 
TO 1= a A -i6. A default r tolerates a set of defaults A iff it has a verifying model 
which does not falsify any defaults in A; such a model will be called a eonfirming 
model of r with respect to A. 

It has been shown in [8] that any consequence relation that satisfies all the 
rules of System P plus that of rational monotonicity is equivalent to a total 
ordering of the models of M and, conversely, any total ordering of the models 
of M is equivalent to a so-called rational consequence relation. The rank of a 
formula in such an ordering is the rank of its minimal satisfying model (s). A 
ranking, k, is called admissible with respect to a set of defaults. A, iff for all 
a => b £ A, k{o a 6) -< k{o A -i6). Similarly, a default c d belongs to the 
rational consequence relation determined by k iff k{c A d) -< k{c A ^d). Three 
mechanisms for generating such a total order are provided by System Z (section 
2.2), the lex-ordering (section 2.3) and the me-ranking (section 3). 

2.2 System Z 

System Z [10], or equivalently rational closure [8], can be defined as follows. 
Given a p-consistent set of defaults^, A, it is possible to identify a subset Aq 
made up of all the defaults which tolerate all other defaults in A. Then, given 
A — Ao it is possible to identify another subset, Ai, made up of all the defaults 
which tolerate all members ot A — Aq, and the process continues until all the 
remaining defaults tolerate each other. This process gives the unique z-partition 
A = AoU Ai U .. .U An- Each default is assigned a z-rank which is the index 
of the Ai to which it belongs, and each model is assigned a z-rank of 1 plus 

^ A set of defaults is p-consistent iff every non-empty subset is confirmable [1] or, 
equivalently, iff there exists an admissible ranking function with respect to that set. 
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Fig. 1. The z-rankings for the penguin example. 



the highest z-rank of all the defaults it falsifies, or 0 if it falsifies no defaults. 
This z-ranking is admissible with respect to A and z-entailment is determined 
from this ranking. Since the higher the z-rank of a model the more abnormal 
(in the sense of being less probable) it is, a default is z-entailed iff the z-rank 
of its minimal verifying model(s) is strictly less than the z-rank of its minimal 
falsifying model(s) (meaning that it is more normal for the default to be verified 
than falsified). 

Example 1 (Penguins). 

^ = {b ^ f,b ^ w,p ^ b,p ^ -./} 

(the intended interpretation of this database is that birds fly, birds have wings, 
penguins are birds but penguins do not fly). The z-partition of this database is: 

Ao = {b^f,b^w} and Ai = {p ^ b,p ^ ^f} 

Here £, has four atoms so M contains only 16 models. Figure 1 enumerates these 
models along with their z-ranks. To establish whether the default “penguins 
have wings” is z-entailed, it is necessary to consider the z-ranks of the minimal 
verifying and falsifying models of p => w {rn \2 and tou , respectively): 

z(pAw) = 1 = z(pA-<w) 

and so p => w is not z-entailed. • 

This example illustrates one of the problems with z-entailment — it does not allow 
inheritance to exceptional subclasses. 

2.3 The lexicographic ordering 

The lexicographic ordering was proposed by Lehmann [7] who argued that the 
behaviour of the ideal rational consequence relation should satisfy four presump- 
tions of typicality, independence, priority and specificity. He also drew attention 
to the differences between the presumptive reading of a default, as first de- 
veloped by Reiter [11], and the prototypical reading for which, he claims, the 
rational closure [8, 10] is the “correct formalization”. A more flexible variant of 
Lehmann’s lexicographic closure is given by Benferhat et al. [2] who allow the 
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Fig. 2. The lex-tuples for the penguin example. 



user to determine the priorities of defaults, rather than being restricted to the 
ranks determined by the z-partition. 

Lexicographic entailment is defined as follows. The lex-ordering over the mod- 
els of C is based on the z-partition but takes into account all defaults violated 
by a model, not just that with the greatest z-rank. The result is a form of en- 
tailment which is a direct extension of System Z in the sense that all z-entailed 
defaults are also lex-entailed. 

Given a set of defaults. A, and its z-partition, AqU Ai . . .U An, each model 
is assigned an (n + l)-tuple with the number of defaults violated in partition-set 
Ai appearing in position i of the tuple. The lex-ordering of tuples (and hence 
models) is determined by considering the last elements of the tuples first. If 
one tuple has fewer default violations in the highest tuple element, it is lower 
(or preferred) in the lex-ordering; otherwise the next highest tuple element is 
considered. For example, (1,1,0) -< (0,0,2) and (2,0,1) -< (0,1,1). From the 
lex-ordering, entailment is determined as usual by comparing the lex-tuples of 
the minimal verifying and falsifying models of a default. 

Example 2 (Penguins (eontinued)). Figure 2 gives the lex-tuples of default vio- 
lations for each model. Comparing the minimal verifying and falsifying models 
of p w gives: 

lex(p A w) = (1,0) -< (2,0) = lex(pA-iw) 

and so p w is lex-entailed. • 

As the example demonstrates, lex-entailment does provide for inheritance to 
exceptional subclasses. 

3 Maximum entropy entailment 

Ranking functions can be viewed as an abstraction of a probabilistic semantics 
for defaults [10]. A default can be thought of as a constraint on a probabilitity 
distribution (PD) and so a set of defaults constrains the possible PDs. Usually 
these will not be sufficient to completely specify a single PD. Goldszmidt et al. [4] 
developed the maximum entropy approach to default reasoning by applying the 
principle of maximum entropy which is a well understood means of selecting that 
PD which satisfies a set of constraints and contains the least extra information 
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me-algorithm 

Input: a set of variable strength defaults, {r, : a, ^6,}. 

Output: an me-valid ranking, k, if one exists. 

[1] Initialise all K(r,) = INF. 

[2] While any K(r,) = INF do: 

(a) For all n with K(r,) = INF, compute 
MINV(ri) + Si. 

(b) For all such n with minimal MINV(ri) + s,, 
compute MINF(r,). 

(c) Select Tj with minimal MINF(r,). 

(d) If MINF(rj) = INF let K(rj) := 0 

else let K(rj) := sj +MINV(rj) - MINF(rj). 

[3] Assign ranks to models using equation (2) . 

[4] Check constraints (1) to verify this is an me-valid ranking. 



Fig. 3. The me-algorithm 



[5]. If one has to select a PD from all possible ones, choosing one other than that 
which has maximum entropy means making additional assumptions or implicitly 
assuming extra constraints. 

It would be useful therefore to be able to compare systems of default reason- 
ing with the answers obtained from the me-approach in order to understand what 
implicit assumptions underlie those systems. In order to do this, the me-approach 
originally proposed by Goldszmidt et al. [4] has been extended by Bourne and 
Parsons [3] to admit arbitrary sets of defaults with variable strengths. The me- 
ranking of a set of defaults {r,} with strengths {s,} can be found by applying 
the me-algorithm given in figure 3. The me-algorithm looks for a solution to the 
following set of non-linear simultaneous equations: 



min fme(m)l 


= Si + min [me(TO)] 

m\=aiAbi 


(1) 


me(m) = 




(2) 



m|=aj 



The solution is a set of me-ranks corresponding to each default, {me(rj)}. From 
these, using (2), the me-ranks of each model, {me(TO)}, can be determined. 

As discussed in detail in [3], the ranking found by the me-algorithm may 
not always be a unique solution to the equations, indeed for certain strength 
assignments no solution may exist, however the algorithm does find the unique 
solution when there is one. 

Example 3 (Penguins (eontinued)). Let each rule r, have an associated strength 
of Si- The constraint equations (1) give rise to: 

me(n) = Si me(r3) = S3 + min(me(n), me(r2)) 

me(r2) = S2 + min(me(ri), me(r3)) me(r4) = S4 
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Fig. 4. The me-ranks for the penguin example. 



which have the unique solution me(ri) = si, me(r 2 ) = si + S 2 , n>e(r 3 ) = si + S 3 , 
and me(r 4 ) = S4. The me-rankings are given in figure 4. 

Comparing the minimal verifying and falsifying models oi p ^ w gives: 

me(pAw) = Si < Si + min(s 2 , S4) = me(p A ~<w) 

and so p => w is me-entailed. 

Clearly, this default is me-entailed under any strength assignment because 
the solution for the {me(rj)} holds for any {s,}. This will not be true in general as 
different strength assignments may map to qualitatively different me-rankings. • 

As the example demonstrates, me-entailment also provides for inheritance to 
exceptional subclasses. 

4 Translating lexicographic to maximum entropy 

By changing the strengths assigned to defaults, it is possible to produce many 
different me-rankings, all of which represent rational consequence relations [3]. 
The me-rankings differ because the different strengths change the default infor- 
mation being encoded. However, the me-ranking corresponding to any given set 
of strengths represents the least biased estimate of the underlying probability 
distribution [5]. In contrast, the lex-ordering is unique and fixed for a given set of 
defaults [7] . It follows that the lex-ordering implies some additional assumptions 
are being made about what default information represents and it is reasonable to 
ask what these might be. By showing that the lex-ordering can be equated to a 
class of me-rankings, this section aims to make explicit the underlying semantics 
of lexicographic entailment. 

The similarity between these two forms of entailment lies in the fact that 
in both methods the ordering makes use of all defaults falsified by each model. 
In the lex-ordering the tuple represents the position and number of defaults 
falsified, whilst for the me-ranking, the me-rank of each model is the sum of the 
me-ranks of each default it falsifies. Thus by assigning appropriate me-ranks to 
the defaults it is possible to create an me-ranking which is equivalent to the lex- 
ordering, in the sense that the ordering of models is the same. It is then possible 
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Translation algorithm 



Input: A partitioning of A, AoUAi...UAn- 

Output: The canonical me-ranking, me^i , plus associated strength 
assignment, {si}. 

[1] Let me(rj) = 1 for all n € Aq. 

[2] For h = \ to n: 

(a) Let me(zlA:) = (|z1a:_i| + 1) * me(zlA:_i) . 

(b) Let me(rj) = me(zlj,) for all n € /A*. 

[3] For each r, : 

(a) Find the ranks of its minimal verifying and falsifying models, 
mezi(wri) and meA{fri)> using equation (2). 

(b) Set Si = me^i (/ri ) — me^i («ri ) • 

Fig. 5. The translation algorithm 



to compute what strength assignment over defaults gives rise to this me-ranking. 
From the characteristics of this strength assignment, it is possible to interpret 
what exactly the lex-ordering means in terms of what the implications are for 
the relative strengths of defaults. 

In order to create an me-ranking equivalent to the lex-ordering, all defaults in 
a given partition-set should have the same me-rank. This ensures that whenever 
two models falsify different defaults which belong to the same partition-set, 
the “penalty” associated with each is the same. In addition, it must always be 
worse to falsify defaults in a certain partition-set than to falsify any number of 
defaults in lower sets. Thus the me-rank assigned to defaults in the partition-set 
Ai, denoted me(zi,), must be greater than the sum of the me-ranks of all defaults 
in lower sets. The translation- algorithm given in figure 5 accomplishes such an 
assignment of me-ranks to defaults. 

Note that the me-rank assignment in step [2] (a), is arbitrary to the extent 
that any integer greater than the sum of the me-ranks of all defaults in lower 
partition sets would suffice. Thus there is a whole class of me-rankings which are 
equivalent to a given lex-ordering. 

Once the me-ranks have been assigned to rules it is a simple matter to calcu- 
late the corresponding strength assignment necessary to achieve this me-ranking: 
each default has a strength which is equivalent to the difference between the me- 
ranks of its minimal falsifying and verifying models. The strength of any default 
in the me-ranking found using the translation algorithm will be called the canon- 
ical me-strength of that default. Note that not only the defaults in the original 
set, but also any default which is lex-entailed (and hence me-entailed in the 
canonical me-ranking) will have an associated canonical me-strength^. 



^ In [3], the me-ranking is shown to be the unique solution to equations (1) and (2) 
if it satisfies a condition termed “robustness”. If the lex-ordering is robust then 
so is the canonical me-ranking which in turn implies that the canonical me-strength 
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The following example shows the translation algorithm at work leading to a 
canonical me-strength assignment which gives an identical rational consequence 
relation to that given by the lex-ordering. 

Example 4 (Bears). 

A = {ri : b ^ d,V2 '■ t b,r3 : t -id, Vi : b h,r5 : t A I d} 

(the intended interpretation of this knowledge base is that bears are dangerous, 
teddies are bears, teddies are not dangerous, bears like honey, and teddies with 
loose glass eyes are dangerous). The z-partition has three partition-sets: 

Aq = {b d,b h} Ai = (t -id, t b} A 2 = {t A I d} 

Following the algorithm, set me(n) = me(r4) = 1; then me(zii) = 3, so me(r2) = 
me(r3) = 3; finally me(zi2) = 9, so me(r5) = 9. This me-ranking is robust and cor- 
responds to a strength assignment of (1, 2, 2, 1, 7). The lex-ordering and canon- 
ical me-ranking both induce the same rational consequence relation. Consider 
the default “teddies which are dangerous and do not like honey are bears”. To 
see whether this is entailed, it is necessary to examine the minimal verifying and 
falsifying models oi t A d A => b: 

lex(t A d A -i/i Ab) = (1, 1, 0) -< lex(t A d A -i/i A -<b) = (0, 2, 0) 

me^i {t A d A -ih A b) = 4 < me^i (t A d A -i/i A -16) = 6 

and so this default is both lex-entailed and canonically me-entailed. • 

The translation algorithm finds a set of canonical me-strengths for any set 
of defaults that leads to an me-consequence relation which coincides with the 
lex-consequence relation. In fact there is an infinite class of such strength as- 
signments. The implication is that the lex-consequence relation is just a special 
case of the me-consequence relation. So what are the additional assumptions 
underlying the lex-ordering? 

The canonical me-strengths of defaults increase exponentially with the index 
of the partition set to which they belong. Effectively, the defaults in higher sets 
are deemed to hold more strongly under lexicographical entailment. Now, the 
z-ranking actually represents the exponent of qualitative probabilities, or the 
relative order of magnitude of models. The strength of a default, in contrast, 
represents an order of magnitude relation between sets of models. When the 
lex-ordering is translated into an me-ranking, the strength associated with each 
default is inversely connected with the probability of its minimal verifying model 
so that the strength of a default increases as the probability of it actually being 

assignment leads to a unique me-ranking. For non-robust lex-orderings, the canonical 
me-strength assignment might lead to multiple me-solutions. Elowever, since the 
canonical me-ranking is already arbitrary to some extent, this does not have a bearing 
on the analysis and can be safely ignored. Readers interested in robustness and 
multiple solutions are referred to [3]. 
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verified decreases. The principles used by Lehmann to justify the lex-ordering 
[7] bear no relation to this observation, however. Benferhat’s version of the lex- 
ordering [2] , which allows the user to specify the priorities explicitly, has a better 
justification since at least then the increase in strength can be viewed as the 
realisation of the default priorities which the user has chosen to impose. 

However, in both lex-systems, the canonical me-strengths which the trans- 
lation algorithm supplies do not directly correspond either to the partition sets 
or to the priorities the user assigns. This is because for two defaults which have 
the same priority, the lex-tuples of their minimal verifying and falsifying models 
may differ slightly leading to differences in their canonical me-ranks. In both sys- 
tems, the priorities only determine the order of magnitude of the canonical me- 
strengths which may vary slightly for defaults of the same priority. So although 
the lex-ordering allows the priorities to be specified, this cannot be achieved in 
isolation from the other defaults. In contrast, using the me-approach directly al- 
lows the user to specify the default priorities explicitly and independently. Thus, 
if the object of using a lexicographic ordering is to allow the knowledge engineer 
to make explicit his judgments about default priorities, it can be argued that us- 
ing maximum entropy and variable strengths is the fairest and most transparent 
way to achieve this. 



5 Behaviour of lexicographic entailment 

It is interesting to examine the behaviour of systems of default reasoning from 
the meta-level perspective. For example, it is well known that while System P 
maps a set of defaults into a nonmonotonic consequence relation. System P itself 
is strictly monotonic on the addition of further defaults. This behaviour has been 
termed “semi-monotonic” by Pearl [10] but, in fact. System P behaves classically 
if defaults are given the appropriate semantics (e.g., let a default correspond to 
the set of its admissible ranking functions) . 

The behaviour of systems when their consequences are learned, i.e., an en- 
tailed default is added to the set which entailed it, can be used to argue for the 
reasonableness of adopting such a system. It has been suggested [9] that sys- 
tems should satisfy rules like those of System P at the meta-level although how 
these should be interpreted is not always obvious. Lehmann himself pointed out 
that lex-entailment does not satisfy cautious monotonicity since adding entailed 
defaults may lead to the retraction of previous conclusions [7] . The following the- 
orems make clear why this occurs and what the implications are for the canonical 
me-strengths with which defaults are entailed. 

Theorem 1 shows that, provided a default is not entirely unexpected, i.e., its 
converse is not z-entailed, then the z-partition (and hence the z-rank of defaults) 
will not change radically on the addition of that default. In fact, a small ripple 
effect occurs with the new default being added to the appropriate partition set 
and defaults of equal or higher rank may or may not be ‘shunted up’ by one 
degree. 
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Theorem 1 (Dynamics of z-partition). Consider a set of defaults, A, with 
z-partition Aq U . . . U An- Let r be a default sueh that the z-rank, k, of its mini- 
mal verifying model is not more than the z-rank of its minimal falsifying model 
(equivalently, the eonverse of r is not z-entailed by A). Then (1) the z-partition 
of A' = {r} yj A is sueh that A[ = Ai for i < k, (2) r & A'j. and (3) for all 
r' e Ajyk, either r' € At or r' € Atj^-^. 

Proof. All confirming models for the defaults in Aq U . . . U A/^-i neither verify 
nor falsify r by the conditions of the theorem, hence the first k partition-sets in 
the new z-partition will be the same, that is, for i < k, At = Ai, as required. 

Now if Vr is a minimum verifying model of r, it is also a confirming model 
for r wrt {r} U U . . . U An, since it may falsify defaults in Ai^^ but not in 
higher sets. Thus r & A'f., as required. 

Finally, consider Vr' , a verifying model for some default r' € Aj. which 
previously confirmed r' wrt U . . . U zi„. If Vr' satisfies r then it is also a 
confirming model of r' wrt {r} U zi^, U . . . U zi„, so r' € zij;,. Otherwise r' 
does not tolerate {r} U zi^, U . . . U zi„. Therefore separate Aj. into those de- 
faults which tolerate {r} U zi^, U . . . U zi„, say zAtj, , and those which do not, say 
A^Tu ■ Then A'^. = {r} U At,, and it remains to partition A^Tu U Af.j^\ . . . U zi„. 
Clearly all defaults in A^Tu tolerate A^Tu U Af.j^\ . . . U zi„ since they did pre- 
viously and so A^Tu C ziJ;,_|_j. Separate Af.j^\ into those defaults which tolerate 
zi-,Ti, U Af.j^i . . . U An, say At,,^,, and those which do not, say - Then 

ziJ;,_|_j = zi-,Ti, U At,,^, and it remains to partition U Af.j ^2 • • • U zi„. Pro- 

ceeding in this way, the z-partition of A' is formed such that for any default, 
r' e Ajyf., it holds that either r' € At or as required. • 

Theorem 2 shows that if a default is entirely expected, i.e., is z-entailed, then 
no z-ranks change. This demonstrates why System Z can be called the rational 
elosure of the set, since the addition of a z-entailed default will not lead to any 
new z-conclusions; there will undoubtedly be further lex-conclusions, however. 

Theorem 2. Given the eonditions of theorem 1, if the z-rank, k, of the minimal 
verifying model of r is strietly less than the z-rank of its minimal falsifying model 
aeeording to A (equivalently, r is z-entailed by A), then A'^. = {r} U Aj^ and 
At = Ai for i ^ k. 

Proof. Since r is z-entailed by A, all confirming models of defaults in A^ have 
z-rank k and therefore cannot be falsifying models of r. Hence all defaults in Aj. 
tolerate {r} U zi^, U . . . U zi„, and A'f. = {r} U Af.. All other partition-sets remain 
unchanged. • 

Finally, theorem 3 demonstrates that adding a default to a set which lex- 
entailed it, leads to the default obtaining a higher canonical me-strength. Clearly, 
this is to be expected since when the lex-entailed default is learnt, violating it 
takes on more significance. 

Theorem 3. If using me^, the eanonieal me-ranking for A, r is me-entailed 
with strength s, then using me^i , the eanonieal me-ranking for A' = {r} U zi, r 
is me-entailed with strength s' > s. 
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Proof. Let the z-partition of Zi be Zio U . . . U zi„ and the lex-equivalent me- 
ranks associated with each partition set be me(zio), • • •, me(zi„). Let the minimal 
verifying and falsifying models for r in the me-ranking be Vr and fr , respectively. 
Then s = me^ifr) — me^irr). 

First suppose that r is z-entailed by A so that if z{vr) = k then z(/r) > k. 
Then by theorem 2 the z-partition of A' has A'/, = {r} U A^ and zi' = zi, 
for i ^ k. Hence me{A'^) = me(zi,) for i < k and me(zit) > me(zij) for j > k. 
Now, me^-(ur) = me^(ur) since Vr only falsifies defaults in partition-sets zio 
to Ak-i. However, now falsifies an extra default, r itself, and so its me- 
rank must be higher by at least me(Ai^). Hence s' = raeA'(fr) — meA'(vr) > 
me^ifr) + ^&(Ak) — meA(vr) > s, as required. 

Now suppose that r is only lex-entailed so that z{vr) = z(/r) = k. Then the z- 
partition of A' is as described in theorem 1 so that r & A'f.. Now me(zi') = me(zi,) 
for i < k. Again me^-(ur) = me^(ur). However, since z{fr) = k it follows that 
me^ifr) < n>e(zii;) but meA'(fr) > nie(zii;). Hence s' = meA'if^) — meA'(vr) > 
me(Ai^) —meA(vr) > —me^irr) = s, as required. • 

Theorem 3 shows that adding a default to a set which lex-entailed it leads to it 
obtaining a higher canonical me-strength than that with which it was previously 
me-entailed. This would seem to be an explanation of the fact that lex-entailment 
fails to satisfy cautious monotonicity. Syntactically, theorem 1 confirms this since 
the addition of a lex-entailed default may lead to a revised z-partition which no 
longer lex-entails old conclusions. However, one could argue that, according to 
the semantic interpretation of lex-entailment as a form of me-entailment, it is 
not possible to add a lex-entailed default to a set without changing its semantics, 
i.e., its canonical me-strength. In a sense, this argument implies that cautious 
monotonicity is simply not applicable to lex-entailment since the semantics of a 
default cannot be specified independently of its surrounding defaults. 

The behaviour of me-entailment on the addition of me-entailed defaults is 
interesting. It depends critically on the strength assigned to the given default 
compared with the degree to which it is me-entailed^. If it is assigned a lower 
strength then no admissible me-ranking exists, whilst if it is assigned a higher 
strength a revised unique me-ranking is produced. If the added default is assigned 
a strength equal to the degree to which it was previously entailed, it is usually the 
case that there are multiple solutions for the me-ranking. An me-ranking with 
the added default taking zero me-rank is one solution — one could say in this case 
that the default is redundant — but there may be other solutions in which it is not 
the added default which is redundant but one of the originals. A more detailed 
account of these findings may be found in [3] . Thus it is possible for the addition 
of the default to lead to the same me-ranking, that is, me-entailment does satisfy 
cautious monotonicity, however one must be careful since this solution may not 
be unique. 



® That is, the difference between the me-ranks of its minimal falsifying and verifying 
models. 
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6 Conclusion 

This paper has compared lexicographic entailment with maximum entropy en- 
tailment and found the former to be a special case of the latter. It has been argued 
that the me-approach is better justified since it is based on a well-understood 
principle of indifference [5], and that it is a better method for representing judg- 
ments about the relative priorities between defaults because these can be made 
explicitly and independently. The behaviour of both systems was also examined 
to show why lexicographic entailment fails to satisfy the meta-rule of cautious 
monotonicity and how maximum entropy entailment does satisfy it under certain 
conditions and with certain caveats. 



Acknowledgements 

This work was partly funded by the EPSRC under grant GR/L84117. The first 
author was supported by an EPSRC studentship. The authors would like to 
thank two anonymous referees for their comments on an earlier draft of this 
paper. 



References 

1. E. Adams. The Logic of Conditionals. Reidel, Dordrecht, Netherlands, 1975. 

2. S. Benferhat, C. Cayrol, D. Dubois, J. Lang, and El. Prade. Inconsistency manage- 
ment and prioritized syntax-based entailment. In R. Bajcsy, editor. Proceedings of 
the International Joint Conference on Artificial Intelligence, pages 640-645. Mor- 
gan Kaufmann, 1993. 

3. R. A. Bourne and S. Parsons. Maximum entropy and variable strength defaults. 
In Proceedings of the Sixteenth International Joint Conference on Artificial Intel- 
ligence, 1999. 

4. M. Goldszmidt, P. Morris, and J. Pearl. A maximum entropy approach to non- 
monotonic reasoning. IEEE Transactions on Pattern Analysis and Machine Intel- 
ligence, 15:220-232, 1993. 

5. E. Jaynes. Where do we stand on maximum entropy? In R. Levine and M. Tribus, 
editors. The Maximum Entropy Formalism, pages 15-118, Cambridge, MA, 1979. 
MIT Press. 

6. S. Kraus, D. Lehmann, and M. Magidor. Nonmonotonic reasoning, preferential 
models and cumulative logics. Artificial Intelligence, 44:167-207, 1990. 

7. D. Lehmann. Another perspective on default reasoning. Annals of Mathematics 
and Artificial Intelligence, 15:61-82, 1995. 

8. D. Lehmann and M. Magidor. What does a conditional knowledge base entail? 
Artificial Intelligence, 55:1-60, 1992. 

9. D. Makinson. General theory of cumulative inference. In M. Reinfrank, J. de Kleer, 
M. L. Ginsberg, and E. Sandewall, editors. Lecture Notes in Artificial Intelligence 
346 , pages 1-18, Berlin, 1988. Springer. 

10. J. Pearl. System Z: a natural ordering of defaults with tractable applications to 
default reasoning. In Proceedings of the 3rd Conference on Theoretical Aspects of 
Reasoning about Knowledge, pages 121-135, 1990. 

11. R. Reiter. A logic for default reasoning. Artificial Intelligence, 13:81-132, 1980. 




Avoiding Non-Ground Variables 



Stefan Briining^ and Torsten Schaub^ 

^ TLC GmbH, HahnstraBe 43a, D-60528 Frankfurt, Stefan . Bruening@tlc . de 
^ Institut fur Informatik, Universitat Potsdam, Postfach 60 15 53, D-14415 Potsdam, 
torsten@cs . uni-potsdam. de 



Abstract For many reasoning tasks in Artificial Intelligence, it is much sim- 
pler (or even essential) to deal with ground inferences rather than with infer- 
ences comprising variables. The usual approach to guarantee ground inferences 
is to introduce means for enumerating the underlying Herbrand-universe so that 
during subsequent inferences variables become bound in turn to the respective 
Herbrand-terms. The inherent problem with such an approach is that it may cause 
a tremendous number of unnecessary backtracking steps due to heaps of incorrect 
variable instantiations. In this paper, we propose a new concept that refrains from 
backtracking by appeal to novel inference rules that allow for correcting previous 
variable bindings. We show that our approach is not only benehcial for classical 
proof systems but it is also well-suited for tasks in knowledge representation and 
reasoning. The major contribution of this paper lies actually in an application of 
our approach to a calculi conceived for reasoning with default logic. 



1 Introduction 

Automated theorem proving technology plays a traditionally important role for knowl- 
edge representation and reasoning. This is because knowledge-based systems need pow- 
erful inference engines for exploiting their underlying knowledge-bases. As opposed to 
mathematical problem sets, however, one encounters differently structured problems 
sets when dealing with knowledge-hases. In fact, most applications remain within the 
framework of Datalog [4] hy relying on the absence of function symbols while pre- 
suming a large yet finite set of constant symbols, in the tradition of classical database 
systems. In the theoretical literature one usually appeals to the set of ground instances 
of a given knowledge base. Even though this is legitimate from a theoretical point of 
view, it almost always turns out to be infeasible from a practical point of view. 

A related problem is found in (particular areas of) automated theorem proving, 
where one is concerned with descision procedures for fragments of first-order logic 
like function-free clause sets (cf. [5]). There, the main problem is to avoid infinite 
derivations which is achieved by (partially) instantiating non-bound variables. A simi- 
lar technique is also found in the area of deductive databases, where one is concerned 
with so-called safe Datalog rules in order to guarantee finite intensional relations [4]. 
The idea underlying this approach is to introduce means for enumerating the underly- 
ing Herbrand-universe, so that in the subsequent inferences variables get bound in turn 
to Herbrand-terms. The usual problem with such an approach is that it may cause a 
tremendous number of unnecessary backtracking steps due to heaps of incorrect vari- 
able instantiations. 
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In view of the last two problems, we propose an approach that refrains from back- 
tracking by appeal to novel inference rules that allow for correcting previous variable 
bindings. This does not only allow us to decide function-free clause sets (a non-trivial 
problem with standard calculi') but it moreover turns out to be especially well-suited 
for tasks in knowledge representation and reasoning. We illustrate this by showing, af- 
ter the presentation of our basic calculus, how it can be applied to an inference problem 
in non-monotonic reasoning, namely query-answering in default logic [12]. 

In what follows, let us make the aforementioned problems more precise: In fact, we 
have chosen model-elimination (ME; [8]) as our basic calculus. Apart from the perfor- 
mance of its resulting theorem provers, ME is closely related to the inference systems 
used for logic programming and deductive databases. And moreover it has become a 
rather popular approach for addressing diverse problems in knowledge representation 
and reasoning, e.g. [11,1,13]. The basic inference steps of ME-based calculi are called 
extension and reduction step. Intuitively, an extension step amounts to Prolog’s use 
of input resolution: A subgoal -iL is resolved with an input clause {L, K\, . . . , Kn} 
resulting in the new subgoals , Kn. The reduction step renders the inference 

system complete for (full) propositional clause logic: A subgoal is solved if it is com- 
plementary to one of its ancestor subgoals. One of the most important (and practically 
indispensable) refinements for ME-based calculi is expressed by the regularity condi- 
tion (e.g. see [6, 7]). It allows for discarding subgoals identical to one of their ancestor 
subgoals. 

For making the aforementioned problem more precise, consider the (negated) query 
-ip(6, e) along with clause set 

{ {p(X, Z),^p(X,Y),^p(Y, Z)}, {p(a,b)}, {p(b,c)}, {p(c,d)}, {p(d,e)}} (1) 

whose first clause encodes the transitivity of predicate p. (Note that deciding the unsat- 
isfiability of such a clause set is a rather difficult problem for pure ME- or resolution 
calculi. For instance, none of the resolution calculi given in [5] allows to decide such 
clause sets.) If every goal during a ME-derivation has to be ground (in this case, a 
ME-calculus employing the regularity restriction is guaranteed to terminate), one could 
follow the aforementioned classical approach which requires to introduce a new pred- 
icate g in order to enumerate possible instantiations for variable Y in the transitivity 
clause. The resulting clause set looks as follows: 

{ {p(X, Z), ->g(y), ^p{x, Y), ^p{Y, Z)}, {p{a, 6)}, {p{b, c)}, {p{c, d)}, {p{d, e)}, 
{§(«)}, {g(6)},{g(c)},{g(d)},{g(e)}} 

Further, during a ME-derivation one has to ensure that, after using the transitivity clause 
as input clause for an extension step, the next derivation step is applied to the new sub- 
goal -ig(y). The drawback of this solution is obvious: Whenever a wrong instantiation 
is chosen when solving subgoal -'g(y), one has to backtrack the derivation up to sub- 
goal -ig(y). For instance, consider the case where the first subproof of -'g(y) results in 
substitution {y\a}. This wrong guess makes it impossible to apply the clause {p{b, c)} 
to subgoal ^p{b,Y). Even worse, if the transitivity clause is applied to ^p{b,Y), no 

* At least if derivations are allowed to contain non-bound variables. 
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(sub)refutation can be found. Hence, it is necessary to backtrack the derivation up to 
the subproof of -ig(y). 

The technique proposed in this paper overcomes this problem. This is achieved 
by treating those substitutions that are only required to instantiate variables in open 
goals as assumptions that can be dropped or corrected during the subsequent derivation. 
These assumed substitutions are collected in a supplementary substitution, called p in 
the sequel, and modihed by two new derivation steps, called instantiation and correction 
step. This rehned handling of substitutions totally avoids backtracking caused by wrong 
instantiations. 

For illustration reconsider the original clause set (1) along with (negated) query 
-ip(6, d). At the beginning of a derivation, p is set to the empty substitution. An ex- 
tension step with the transitivity clause results in two subgoals containing variable Y , 
namely ->p(6, Y) and ^p{Y, d). In our approach, the existence of non-ground goals re- 
quires to apply an instantiation step which generates a grounding substitution for Y, say 
{y\a}, which is added to the supplementary substitution p. The main difference to the 
classical approach is the treatment of p: p is not considered in the following derivation 
steps, it is only used for purposes requiring ground derivations. For our example, this 
means that after the application of the instantiation step, it is possible to apply an ex- 
tension step to -ip(6, Y) with input clause {p(6, c)}. Hence the assumption {y\a} does 
not prevent a later derivation step that instantiates Y with a different term (note that this 
would require backtracking in the standard approach). On the other hand, if extension 
steps with the transitivity clause as input clause are applied to ->p(6, Y) (and the result- 
ing subgoals), p is usable for detecting non-regular tableaux: Each derivation generating 
a tableau T such that T p violates regularity can be pruned if p cannot be corrected (by 
a correction step) in such a way that T p becomes regular. Since each inner literal of 
Tp is ground^, such a procedure guarantees that no infinite derivation can be generated. 
Hence, it is possible to detect that, for example, ->p(6, a) is no logical consequence of 
clause set (1). Note that this is impossible if the ordinary regularity criterion is applied. 

The above techniques provide a basic automating reasoning framework suited for 
knowledge representation purposes, in particular in view of Datalog-like languages. 
Our ultimate goal is however the adaption of this framework to such purposes rather 
than the framework as such. The major contribution of this paper lies thus in Section 3, 
where we give an application of this approach to a calculi conceived for reasoning with 
default logic. 



2 The basic approach 

This section provides the basic inference steps which can be seen as a starting point 
for the dehnition of ME-calculi that integrate a refined handling of variables. Among 
the diverse mouldings of ME-based calculi, we consider those variants that rely on so- 
called ME-tableaux as basic proof objects (cf. [6, 7]). 

^ Note that the purpose of an instantiation step is to generate an assumption for one open goal 
rather than all open goals of a tableau T. Hence it is only guaranteed that the inner literals of 
T are ground. 
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Our approach adapts the classical ME-inference steps in the following way: First, 
we associate each tableau T with a substitution pj- gathering ground substitutions for 
variables occurring in T. Second, we restrict the application of extension and reduction 
steps to ground goals in Tpr- Now, we introduce two new derivation steps, called 
instantiation and correction step: Given a tableau T, the purpose of the former step is 
to extend pj- such that an open (non-ground) goal in T pr becomes ground. The latter 
can be used to correct the decision made by an instantiation step; this guarantees that it 
is never necessary to take instantiation steps into account during backtracking. 

We deal with a language whose terms consist of a finite non-empty set of nullary 
function symbols, called constants. Formulas over this language are referred to func- 
tion free. Let c{C) be the non-empty set of constant symbols occurring in clause set C. 
Otherwise, we set c{C) = {a} for some constant a. rg{a) and dom{a) denote the range 
and the domain of a substitution a, respectively. For a literal L, var{L) denotes the set 
of variables occurring in L. We let denote the literal complementary to L. 

Formally, a tableau is a pair {t, A) consisting of an ordered tree t and a labeling 
function A assigning literals to the non-root nodes in t. A branch of a tableau T is a 
sequence (oi, . . . , o„) of nodes in T such that o\ is the root of T, o, is the immediate 
predecessor of Oj_|_i for 1 < i < n, and o„ is a leaf of T. We sometimes denote a 
branch (oi , . . . , o„) by a sequence containing the labels of its nodes, that is, we write 
(A(o 2 ), . . . , A(o„)) (note that the root node of a tableau is never labeled with a literal). 
A branch is complementary if the labels of its nodes oi , . . . , o„ contain some literal L 
and its complement L^. In order to distinguish the simple presence of a complementary 
branch and the detection of this fact, we allow to label branches as closed. Each branch 
which is labeled as closed must be complementary. A branch which is not marked as 
closed is called an open branch. A tableau is closed if each of its branches is closed, 
otherwise it is open. 

Throughout the following definitions, let T = {t, A) be an arbitrary yet fixed tableau 
of some set of input clauses C, and let pj- be a ground substitution with rg{pj-) C c{C). 
The purpose of pp is to accumulate the assumed substitutions needed for grounding 
tableau T : 

Definition 1. (T' , ppi ) is obtained by an initialization step as follows. Let o be the root 
of a one-node tree. Select in C a negative Lo-clause {Li, . . . , L„} € C. Then, attach n 
new successor nodes to o, and label them in turn with L\, . . . , L„. Finally, ppi is set to 
the empty substitution. 

The definition of extension and reduction steps is identical to the one in classical 
ME-calculi except that they can only be applied to a tableau T if the selected open goal 
in T pp is ground. The application of an extension or a reduction step to a tableau T 
does not affect its associated substitution pp. 

Definition 2. (T' , pp' ) is obtained from (T, pp) by an extension step as follows. Select 
in t a leaf node o of an open branch labeled with literal L where var(Lpp) = 0. Let 
{Li , . . . , L„} be a new instance of some clause in C such that L^a = Lia for some 
i e {1, . . . , n} and some mgu^ a. Then, attach n new successor nodes o\, ... ,o„ to o, 

^ mgu stands for most general unifier. 
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label them in turn with L\, . . . , L„, respectively, and apply a to the resulting tableau. 
The new branch with leaf node Oi is marked as closed. Finally, set pq-i = pq-. 

Definition 3. (T', pr') obtained from (T, pr) by a reduction step as follows. Select 

in t the leaf node Ok of an open branch h = {o\, ... ,0k) where Ok is labeled with literal 
L and var{Lpq-) = 0. If there is an ancestor node Oi on b labeled with literal K such 
that L^a = Ka for some mgu a, then apply a to the tableau, mark b as closed, and set 
PT' = PT- 

As long as every open goal is ground during a derivation, the above inference steps 
constitute a complete calculus (for function-free clause sets). However, in case open 
goals are allowed to contain variables, we must provide a possibility to extend pj- in or- 
der to enable the application of extension and reduction steps. To this end, we introduce 
the following inference step. 

Definition 4. (T' , pq-i ) is obtained from (T, pp) by an instantiation step as follows. 
Select in t the leaf node Ok of an open branch b = {o\, ... ,0k) where Ok is labeled with 
literal L and var(Lpp) ^ 0. Let p be a ground substitution such that var(Lppp) = 0. 
Set ppi = ppp and T' = T. 

One can show that the calculus comprising the above inference steps is sound and 
complete for function free, first-order clause sets. In particular, a theorem prover using 
auch a calculus needs not to consider instantiation steps for backtracking since the col- 
lected assumptions do not influence the application of extension and reduction steps. 
In case these assumptions are exploited, for instance to implement some calculi refine- 
ments, like regularity, the situation changes. Then, in order to avoid backtracking of 
instantiation steps one has to provide an additional derivation step which allows to cor- 
rect the substitutions generated by previous instantiation steps “on the fly”. Notably, 
since the application of such a correction step can be further corrected by another ap- 
plication of a correction step, such steps need not to be considered during backtracking 
either. 

Definition 5. (T' , pp' ) is obtained from (T, pp) by a correction step as follows. Select 
a ground substitution p such that p f pp. Finally, set ppi = p and T' = T. 

Clearly, the application of a correction step should only be taken into consideration, 
if the tableau under consideration meets some special criterion. For instance, if one is 
interested in a proof system which is able to decide the satisfiability of function free 
clause sets, one should restrict the application of correction steps to those tableaux T 
where T pp violates regularity. 

In fact, regularity provides a highly efficient and thus practically indispensable 
means for restricting the search space in ME-based theorem proving. Intuitively, it al- 
lows for discarding subgoals identical to one of their ancestor subgoals. Formally, it 
forbids to generate tableaux containing two nodes, o\ and 02 say, on the same branch 
such that oi and 02 are labeled with the same literal. Using this refinement, the num- 
ber of possible tableaux to be built during a deduction decreases considerably in many 
cases (e.g. see [6, 7]). 

For an illustration of derivation steps defined in this section, recall the example 
given in Section 1 consisting of the clause set S = {{p{X, Z),^p{X, Y),^p{Y, Z)}, 
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{p(a, 6)}, {p{b, c)}, {p{c, d)}, {p{d, e)}, {^p{b, e)}} where the last clause represents 
the (negated) query p{b, e). 

1 . A derivation starts with an initialization step generating a tableau T with open goal 
-ip(6, e) and the empty set of assumptions p^-- 

2. The sole possibility to extend T is to apply an extension step with the transitivity 
clause (with a = {A"\6, Z\e}) generating two new open goals, namely ->p(6, Y) 
and ^p{Y, e). Note that pj- remains unchanged. 

3. The next step to be applied to ->p(6, Y) has to be an instantiation step since 
-ip(6, Y)pfj obviously is not ground. As a result, we assume pj- = {y\a}. 

4. Now, it is possbible to apply an extension step to ->p(6, Y). Again we use (a new 
instance of) the transitivity clause which results in new open goals ->p(6, Y') and 
^p(Y', Y) (with a = {X'\b, Z'\Y}). 

5. As before we proceed with an intantiation step (applied to ->p(6, Y'))) which results 
in the overall assumption pj- = {y\a, Y'\b}. 

6. Now we have built a tableau with open goals ^p{b, Y'), ^p{Y' , Y), and ^p{Y, e). 
These goals can be proved via extension steps with clauses {p{b,c)}, {p{c,d)}, 
and {p{d,e)}, respectively. Note that these extension steps are not prevented by 
Pj- (what would be the case if generator literals are used to enumerate possible 
instantiations, see Section 1). 

On the other hand, we can use pj- for regularity checks: For instance, suppose the 
overall assumption generated by the last instantiation step is pj- = {y\a, y'\a}. 
Then, tableau T pr violates regularity since it contains a branch with two occur- 
rences of -ip(6, a). In such a case, a correction step must be applied to correct the 
assumption. In this way we cannot run into infinite derivations."^ Using regularity 
without taking the assumptions into account, this cannot be guaranteed. 

All this provides us with a calculus for deciding function-free clause sets which 
does not suffer from unnecessary backtracking due to wrong instantiations for variables. 
Note that (as indicated above) correction steps are only used for adapting pj- if T pj- 
violates regularity. 

Definition 6 (Ground Model Elimination). A sequence {{Ti , ppi ),■■■, {Tn , Pt„)) 
called a GME-derivation for a clause set C if 

1. for 1 < i < n, Ti is regular and 

2. (7i , Pj\ ) is obtained by an initialization step and 

3. for 1 < i < n, if T-ipj}_i is regular, {T,pji) is obtained by applying to 
{Ti-i,PTi-i) either 

— a reduction step, 

— an extension step, or 

— an instantiation step, 

otherwise, if T-ipPi-i violates regularity, {Ti,pj-f) is obtained by applying to 

PTi-i) 

A derivation generating a tableau T such that T pr is regular cannot be infinite since all inner 
literals of a branch (which are ground) must be different. 
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- a correction step such that TiP% is regular. 

A GM E-derivation is called a GME- refutation if it generates a closed tableau. 

Theorem 1. A function free clause set C is unsatisfiable iff there exists a GME- 
refutation for C. Further, there exists no infinite GME-derivation for C. 



3 Application to Default Model Elimination 

We apply in this section our approach to a proof procedure for default reasoning. This is 
an interesting application since default reasoning involves consistency checks that are 
accomplished at best with corresponding ground derivations. 

Default Model Elimination (DME; [13]) is a ME-based calculus for query answer- 
ing in so-called semi-monotonic default logics, which allow for an easier computational 
treatment. We show that DME can beneht from the approach proposed in this paper. As 
mentioned above, this is due to the fact that each tableau generated by a DME-derivation 
has to be a ground tableau. In [13] this is achieved by introducing additional literals for 
enumerating possible variable bindings. However, as illustrated in Section 1 such a pro- 
ceeding is a source of extensive backtracking. 

In what follows, we adapt the approach presented in the previous sections to the 
needs of DME. To this end, we first give a brief introduction to DME. Due to lacking 
space, we must refer the reader to [ 2 ] for an introduction to default logic (along with 
the concepts of default theory, default rule, extension, etc.). We develop our approach 
for so-called normal default theories, a popular fragment of default logic on which all 
major variants turn out to be equivalent. Such theories comprise default rules of the form 
In fact, any default theory can be translated into an equivalent format guaranteeing 
that a and 7 are atomic propositions. We draw on this in the sequel for further easing 
notation. Also, we keep denoting the prerequisites and the consequents of default rules 
by a and 7 . 

Default Model Elimination (DME) differs from classical ME in the following as- 
pects: 

Extension step: Eor each default a so-called 5-clause { 7 , ^a} is generated.^ Such 
a clause is restricted as input clause for extension steps which are applied to open 
goals of the form - 17 . This restriction reflects the fact that defaults are inference 
rules rather than formulas. A clause, that is no (5-clause is called co-clause. Ex- 
tension steps using (5-clauses as input clauses are called (5-extension steps, other 
extension steps are called w-extension steps. 

Reduction step: Defaults cannot be used for reasoning by cases. This must be reflected 
by the corresponding (5-clauses. This is achieved by restricting the reduction step 
as follows: Given an open branch b = {Li , . . . , Lf., . . . , Lf), the open goal L„ can 
only be solved by a reduction step via some branch-literal if no successor of 
on b stems from a (5-clause. 

^ Recall that a and 7 are supposed to be atomic. 
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Consistency checking: Default reasoning requires that the defaults, say 1 to n, used in 
the course of a derivation are consistent, that is TC U {{ 71 }, • • • , {7n}} \/ -L. In a 
DME-derivation, this condition is checked during each (5-extension step. 

To restrict computational efforts, DME relies on a model-based approach for this 
purpose: In the course of a derivation, a model m is used as a compact repre- 
sentation for the consistency of IE U {{ 71 }, • • • , { 7 n}}- If the application of a S- 
extension step brings in a new (5-clause {-iq;, 7 }, (and some substitution a) it is 
checked via a satisfiability check, whether m |= ja holds. (Note that this check is 
particularly simple if the tableau under consideration is ground. In such a case ja 
is ground and the check boils down to a simple satisfiability check.®) If this is the 
case, the derivation continues. Only for those cases where to ^ 7 , a new model for 
W U {{ 71 }, • • • , { 7 n}} U {{ 7 }} must be found. If no such model can be provided, 
(5-clause {^a, 7 } cannot be used in the current situation. 

In [13] it is proved that DME is a sound and complete calculus for query answering in 
semi-monotonic default logics, in particular, for normal default theories. 

Further, it was shown that in order to preserve completeness, the concept of regular- 
ity has to be refined for DME. This refinement was called blockwise regularity in [13]; 
given a branch b = {Li , . . . , !/„)> it requires that 

(i) the immediate ancestors of literals stemming from (5-clauses must be different, i.e. 
no default be used to prove the prerequisite of a default with the same consequence, 
and 

(ii) for all 1 < i < j < n, L, and Lj must be different unless a (5-extension step 
has been applied to some Lj. with i < k < j. That is, the “classical” parts of a 
DME-derivation have to be regular in the classical sense. 

When applied to clause sets and defaults containing variables, one has to cope with 
two problems: 

1 . Using blockwise regularity as the only means for restricting derivations is no longer 
sufficient for avoiding infinite derivations. 

2. Since the justification of a default might contain variables, the aforementioned ap- 
proach to consistency checking becomes more complex since, for instance, a sat- 
isfiability check via a simple membership test is no longer possible (provided one 
uses Herbrand models and ground atoms, as done in [10]). 

Both problems can be successfully handled by incorporating the techniques intro- 
duced in Section 2 into DME: Instantiation steps, on the one hand, guarantee that T pr 
contains no inner literal with variables.^ This is particularly useful because every infi- 
nite tableau meeting this criterion violates blockwise regularity. On the other hand, if 
some (5-clause {-> 0 ;, 7 } is applied (for a (5-extension step), it is guaranteed that ycrpr 
is ground which allows a simple satisfiability check (see above). As in the previous 
sections, correction steps come into play when it is necessary to correct 

® This is even an easy membership check if Herbrand models and atomic propositions are dealt 
with. 

^ Recall Footnote 2. 
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The necessity to guarantee consistency in the course of a DME-derivation, however, 
requires to adapt the inference steps introduced in Section 2. In the course of a derivation 
we always have to ensure that given a tableau T there is a model for* 

ITU{{7i},...,{74}Pr (3) 

where 71 , . . . , 7„ denote the tableau literals in T corresponding to the consequents of 
the employed (5-clauses. In particular, this invariant has to be checked for each deriva- 
tion step, which requires the modification of pj- or makes parts of the assumptions in 
Pj- superfluous (what might happen due to extension or reduction steps). 

For illustration, we extend our running example with clause {-if (6, c)} and default 
^ _ p{x,Ty . Further, we use -if (6, Y) as (negated) query. At the beginning of the 

derivation, we apply an instantiation step which generates the assumption p = {y\a}. 
Afterwards, an extension step is applied which uses the (5-clause corresponding to 5, 
namely {t{X, Y), ^p{X, T")}. Until now the aforementioned invariant holds since there 
exists a (Herbrand) model for W U {{t{b, y)}}p. This invariant, however, is violated 
after an application of an extension step with clause {p{b, c)} to the remaining open 
goal since there is (obviously) no model for W U {{t{b, c)}}. Note that this extension 
step makes the previously generated assumption superfluous. 

To distinguish those instances of derivation steps which require to check the in- 
variant, each derivation step is augmented by a boolean expression CC, the so-called 
correction condition. If, after the application of some derivation step, the correction 
condition of this particular step evaluates to true, a correction step (see Definition 12) 
has to be applied in order to correct — if necessary^ — the assumptions in a way which 
satisfies the invariant. If the assumptions cannot be corrected backtracking has to be 
performed. Further, the following definitions (for derivation steps) differ from the ones 
given in Section 2 wrt the following aspects: 

(i) Fach tableau T is associated with a model mj- which bears witness of the consis- 
tency of (3). 

(ii) As explained above, we have to distinguish w-extension and (5-extension steps. The 
definition of the former one is similar to Definition 2 in Section 2. The latter one 
(which reflects the application of defaults) includes the aforementioned consistency 
check (via testing model mj-)- 

(iii) Besides adapting the assumptions made in the course of a derivation, a correction 
step also involves the search for models of (3), if necessary. 

Definition?. (T' , p'p , ni']-' ) is obtained by an initialization step as follows. Let o be 
the root of a one-node tree. Select negative co-clause {L \ , . . . , L„} € C. Then, attach 
n new successor nodes to o, and label them in turn with L\, . . . , L„. Further set p-p = 
0, let rri'p be a (Herbrand) model ofW and set CC = false. 

We call w-clause {Li , . . . , L„} the top-clause. 

* Note that for the sake of completeness, it is in fact necessary to take pj- into account. This is 
because there might exist a model for (3) although W U {{71 }, • • • , {7n }} is unsatisfiable. 

® CC = true does not necessarily imply that the invariant is violated, on the other hand, if the 
invariant does not hold, CC always evaluates to true. 
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Definition 8. (T' , pq-i , rriq-i ) is obtained from (T, pr j ™t) by an co-extension step as 
follows. Select in t a leaf node o of an open branch labeled with literal L where 

var{Lp-f) = 0- 

Let {L \ , . . . , Ln\ be a new instance of some co-clause in C such that L^a = LiO for 
some i € {1, . . . , n} and some mgu a. Then, attach n new successor nodes oi , . . . , o„ 
to o, label them in turn with L\,. . . ,L„, respectively, and apply a to the resulting 
tableau. The new branch with leaf node Oi is marked as closed. Finally set pq-i = pq-, 
rriq-i = m-f, and CC = dom(a) fl dom{pq-) f 0. 

Note that if the correction condition of a w-extension step evaluates to true, a has 
“corrected” at least one assumption of pp. Hence, it is no longer guaranteed that mp is 
a model for (3). 

Definition 9. (T' ,pp',mpi) is obtained from {T, pp,mp) by a 5-extension step as 
follows. Select in t a leaf node o of an open branch labeled with literal L where 
var(Lpp) = 0. 

Let {->as, 'ys} be a new instance of some 5-clause in C and a be a mgu such that 
L^a = 7 ^(T. Then, attach the two new successor nodes o\ and 02 to o, label them 
in turn with -tas and respectively, and apply a to the resulting tableau. The new 
branch with leaf node 02 is marked as closed. Finally set ppi = pp, and 

CC = (i) or (a) where (i) = dom(a) fl dom(pp) f 0 and (ii) = mp ^ pgapp. 

Condition (ii) checks whether the actual model mp is consistent with the applied 
default 5 (which is represented by the (5-clause). If the simple satisfiability check'® fails, 
a correction step has to be applied in order to adapt the assumptions pp or to find a new 
model for (3). 

Definition 10. (T' ,pp',mpi) is obtained from {T,pp,mp) by a reduction step as 
follows. Select in t the leaf node Ok of an open branch b = {oi, ... ,Ok) where Ok is 
labeled with literal L and var(Lpp) = 0. If there is an ancestor node Oi on b labeled 
with literal K such that L^a = Ka for some mgu a, and all nodes Oipi , ... ,Ok are 
uj-extension-resulting nodes, then apply a to T, mark b as closed and set pp' = pp, 
mpi = mp, and CC = dom(a) fl dom(pp) f 0. 

Definition 11. Iff , PP' ,mp') is obtained from {T , pp,mp) by an instantiation step 
as follows. Select in t the leaf node Ok of an open branch b = {o\, ... ,Ok) where Ok 
is labeled with literal L and var(Lpp) f 0. Let p be a ground substitution such that 
var(Lppp) = 0. Then, set ppi = ppp, mpi = mp, T' = T, and CC = false. 

Definition 12. (T' ,pp',mpi) is obtained from {T,pp,mp) by a correction step as 
follows. Select a ground substitution p such that var(Lp) = % for each inner literal of 
T, and there exists a (Herbrand) model mfor W U {{ 71 }, • • • , { 7 n}}p- 

Then, set tableau' = T, pp' = p, mpi = to, and CC = false. 

For illustration, consider our extended running example consisting of (5-clause 
cs = stemming from default 5 = tu-clauses 

S = {{p{X,Y),^p{X,Z),^p{Z,Y)}, {p{a,b)}, {p{b,c)}, {p{c,d)}, {p{d,e)}, 
c)}, {^t{b, H)}} where the last clause represents the (negated) query t{b, Y). 

Note that this simple check is only possible since jsapp is ground. 
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1 . A derivation starts with an initialization step generating a tableau T with open goal 

Y) and a (Herbrand) model nij- for S U {c^ }. Obviously, nij- contains all the 
ground facts in S. 

2. Since the open goal contains a variable (which is not affected by an assumption), 
an instantiation step is applied afterwards, which generates the assumption pj- = 

{y\a}. 

3. Next, it is possible to apply a (5-extension step with (5-clause cs- Since the applied 
substitution does not affect pj- and -if (6, Y)pj- ^ m-j-, no correction step needs to 
be applied. 

4. One might object that the resulting open goal ^p{b, Y) could be solved via an ex- 
tension step with fact p{b,c). If this extension step is applied, a correction step 
has to be applied afterwards since the corresponding substitution makes the pre- 
viously generated assumptions (p-p) useless and therefore affects the consistency 
check performed for Step 3. But such a correction step must fail, since there (ob- 
viously) exists no model for S U {c^} U {{t{b, c)}}. Therefore, the sole possibility 
to expand the tableau is to apply a tu-extension step with the transitivity clause re- 
sulting in two new open subgoals ^p{b, Z) and ^p{Z,Y). Afterwards it is quite 
straightforward to build a closed tableau via instantiation and tu-extension steps. 

In all, we obtain the following calculus for reasoning with cu- and (5-clauses that 
represent an underlying normal default theory: 

Definition 13 (Ground Default Model Elimination). A sequence 
{{Ti, PTn^nj-i) , ■ ■ ■ , {Tn, PTni^Tn)) called a GDME-derivation for a set of 
cj- and 5-clauses C if 

1. for 1 < i < n, Ti is blockwise regular and 

2. (7i , PTi ) is obtained by an initialization step and 

3. for 1 < i < n, ifTi-iP'ji_^ is blockwise regular and the correction condition of 
the derivation step applied to Ti~i evaluates to false, {Ti, p% , mp.) is obtained by 
applying to {T-i , pp-i ■, rnp^_f) either 

— a reduction step, 

- a 5- or Lo-extension step, or 

— an instantiation step, 

otherwise, ifT-ippi_^ violates blockwise regularity or the correction condition of 
the derivation step applied to T-i evaluates to true, {%, pp , rnpf) is obtained by 

applying to {T-i , pp-i , ^ 71 - 1 ) 

- a correction step such that Tpp^ is blockwise regular. 

A GDME-derivation is called a GDME- refutation if it generates a closed tableau 
and the correction condition of the last inference step evaluates to false. 

The next theorem gives the exact correspondence to query-answering in (normal) de- 
fault logic: 

Theorem 2. Let {D,W) be a normal function-free default theory and p be an atomic 
function-free formula. Let C be the clausal representation of the atomic format of 
{D,W). 

We have that p is in some extension of (D,W) iff there is a GDME-refutation for 
C with top-clause {p'^}. Eurther, there exists no infinite GDME-derivation for C with 
top-clause {p^}. 
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4 Conclusion 

Our contribution is twofold: First, we have provided an extension of model elimination 
that is particularly suited for dealing with Datalog-like languages, needed for knowl- 
edge representation. This allows for avoiding backtracking steps over variable instan- 
tiations and moreover for enforcing the practically indispensable regularity condition. 
The most important advantage of this approach is however that it allows for special- 
purpose adaptions, needed for instance for default reasoning. As our second yet major 
contribution, we have taken advantage of this for giving a calculus for query-answering 
in (normal) default logic. The particular benefit was obtained for consistency checking 
that is arguably done at best with ground formulas (and Herbrand models). 

The number of correction steps to be applied in the course of a derivation clearly 
depends on the ’quality’ of the assumptions made during initialization and correction 
steps. One step towards an improved quality of assumptions is to restrict the number 
of possible variable bindings in preprocessing steps. Corresponding techniques can be 
found e.g. in [9] and [3]. Both approaches are based on a propagation of possible vari- 
able bindings through a literal graph. 
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Abstract. Intelligent agents embedded in physical environments need 
the ability to connect, or anchor, the symbols used to perform abstract 
reasoning to the physical entities which these symbols refer to. Anchoring 
must rely on perceptual data which is inherently affected by uncertainty. 
We propose an anchoring technique based on the use of fuzzy sets to rep- 
resent uncertainty, and of degree of subset-hood to compute the partial 
match between signatures of objects. We show examples where we use 
this technique to allow a deliberative system to reason about the objects 
(cars) observed by a vision system embarked in an unmanned helicopter, 
in the framework of the Witas project. 



1 Introduction 

The focus of this paper is on autonomous systems embedded in a real-world, 
physical environment. A typical example is an autonomous mobile robot who 
has to providing services inside a factory, or to explore a far planet. Being em- 
bedded in the physical world, these systems need to incorporate processes at the 
sensori-motor level that provide the needed perceptual and execution capabili- 
ties. However, these systems also need the ability to perform high-level, abstract 
reasoning if they are to operate reliably in a dynamic and uncertain world with- 
out the need for human assistance. For example, a mail delivery robot faced with 
a closed door should decide whether to plan an alternative way to achieve its 
goal, or to reschedule its activities and try again this delivery later on. 

The need to integrate low-level and high-level representations and processes 
is one of the major challenges of autonomous embedded systems, and most of the 
current architectures for autonomous robots address this challenge in some way 
or another [5]. In our work, we focus on one particular aspect of this integration 
problem: the connection between the abstract representations used by the high- 
level reasoning processes to denote a specific physical object, and the data in 
the low-level processes that correspond to that object. Following [7], we call 
anchoring the process of establishing this connection. 
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In general, we assume that the high-level process associates each object in 
its universe of discourse to a unique name, and to a set of properties that (non- 
univocally) describe that object. For example, an object named ‘car-3’ with 
the description ‘small red Mercedes on Road-61.’ Anchoring this object then 
includes two steps: (i) use the perceptual apparatus to find an object whose 
observed features match the properties in the description; and (ii) update those 
properties using the observed values. In our example, anchoring ‘car-3’ means 
to: (i) perceptually find a small red Mercedes on Road-61; and (ii) update the 
description of ‘car-3’ using the observed size and color of the found car, and 
possibly other properties like position and speed. 

One of the difficulties in the anchoring problem is that the data provided 
by the sensory system is inherently affected by a large amount of uncertainty. 
This may result in errors and ambiguities when trying to match these data to 
the high-level description of an object. In order to improve the reliability of the 
anchoring process, this uncertainty has to be taken into account in the proper 
way. Research in fuzzy logic has produced a number of techniques for dealing 
with different facets of uncertainty [4, 1] . In this work, we propose to use these 
techniques to define a degree of matehing between a perceptual signature and an 
object description. The possibility to distinguish between objects that match a 
given description at different degrees is pivotal to the ability to discriminate per- 
ceptually similar objects under poor observation conditions. Moreover, degrees 
of matching allow us to consider several possible anchors, ranked by their degree 
of matching. Finally, these degrees can be used to reason about the quality of an 
anchor, and to perform higher-level decision making; for example, we can decide 
to engage in some active perception in order to get a better view on a candidate 
anchor. 

In the rest of this paper, we deal with the anchoring problem in the con- 
text of an architecture for unmanned airborne vehicles (UAVs) used for traffic 
surveillance. This architecture, outlined in the next section, integrates several 
subsystems, including a vision system and an autonomous decision making sys- 
tem. The anchoring problem in this context means to make the decision making 
system and the vision system agree about the identity of the objects which they 
are talking about, like in our ‘car-3’ example. This is discussed in section 3. 
In Section 4, we show how we use fuzzy sets to represent the uncertain data 
in our domain, and to compute degrees of matching. Section 5 illustrates the 
use of these degrees by going through a couple of examples, which are run in 
simulation. Finally, section 6 discusses the results and traces future directions. 

2 The WiTAS Project 

The WiTAS project, initiated in January 1997, is devoted to research on infor- 
mation technology for autonomous systems, and more precisely to unmanned 
airborne vehicles (UAVs) used for traffic surveillance. 

The general architecture of the system is a standard three-layered agent 
architecture consisting of a deliberative, a reactive, and a process layer. The 
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deliberative layer generates at run-time probabilistic high-level predictions of the 
behaviors of agents in their environment, and uses these predictions to generate 
conditional plans. The reactive layer performs situation-driven task execution, 
including tasks relating to the plans generated by the deliberative layer. The 
reactive layer has access to a library of task and behavior descriptions, which 
can be executed by the reactive executor. The process layer contains image 
processing and flight control, and can be reconfigured from the reactive layer by 
means of switching on and off groups of processes. Besides vision, the sensors 
and knowledge sources of the system include: a global positioning system (GPS) 
that gives the position of the vehicle, a geographical information system (GIS) 
covering the relevant area of operation, and standard sensors for speed, heading 
and altitude. 

The system is fully implemented in its current version. Because of the nature 
of the work most of the testing is being made using simulated UAVs in simulated 
environments, even though real image data has been used to test the vision 
module. In a second phase of the project, however, the testing will be made 
using real UAVs. More information about the project can be found at [9]. 

Of particular interest for this presentation is the interaction between the reac- 
tive layer and the image processing in the process layer. This is done by means of 
a specialized component for task specific sensor control and interpretation called 
the Scene Information Manager (SIM). 




Fig. 1. Overview of the Scene Information Manager and its interaction with the Vision 
module and the Reactive Executor. 



The SIM, figure 1, is part of the reactive layer and it manages sensor re- 
sources: it reconfigures the vision module, via skill configuration calls, on the 
basis of the requests of information coming from the reactive executor, it an- 
chors symbolic identifiers to image elements (points, regions), and it handles 
simple vision failures, in particular temporary occlusion and errors in car re- 
identification. 

In this paper we focus on the anchoring functionality of the SIM, and in par- 
ticular on the matching of symbolic identifiers to image elements in the presence 
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of uncertainty in the data provided by image processing. A description of the 
WiTAS architecture and of the role of the SIM in it can be found in [3]. 

3 Anchoring in the SIM 

Two of the main aspects of anchoring implemented in the SIM are identification 
of objects on the basis of a visual signature expressed in terms of concepts, and 
re-identification of objects that have been previously seen, but that have then 
been out of the image or occluded for a short period. 

For identification and re-identification the SIM uses the visual signature of the 
object, typically color and geometrical description, and the expected positions of 
the object. For instance if the SIM has the task to look for a red, small Mercedes 
near a specified crossing, it provides the vision module with: the coordinates 
of the crossing; the HSV (hue, saturation and value) representation of “red”; 
and the length, width and area of a small Mercedes. The measurements done in 
the vision module have a degree of inaccuracy, and the SIM provides the vision 
module also with the intervals inside which the measurement of each of the 
features is acceptable. The size of the interval depend on how discriminating one 
wants to be in the selection of the objects and also, in the case of re-identification 
of an object, on how accurate previous measurements on the object were. 

The vision module receives the position where to look for an object and 
the visual signature of the object and it is then responsible for performing the 
processing required to find the objects in the image whose measures are in the 
acceptability range and report the information about the objects to the SIM. The 
vision module moves the camera toward the requested position and it calculates 
for each object in the image and for each requested feature of the object an 
interval containing the real value. If the generated interval intersects with the 
interval of acceptability provided in the visual signature for the feature, the 
feature is considered to be in the acceptability range. The vision module reports 
to the SIM information about color, shape, position, and velocity of each object 
whose features are all in the acceptability range. 

Intersection of intervals is a simple, but not very discriminating method to 
identify an object. As a consequence, several objects that are somehow similar 
to the intended one can be sent back by the vision module to the SIM. The SIM 
then needs to apply some criteria in order to perform a further selection of the 
best matching object between those reported by the vision module. The selection 
of the best matching object should depend on how well the objects match the 
different aspects of the signature, but also on the accuracy of the measurements 
performed by the vision and their reliability. In what follows, we show how we 
perform this selection using fuzzy signature matching. 

4 Fuzzy Signature Matching 

Let us look more closely at the process of anchoring a high-level description 
coming from the symbolic system (reactive executor) to the data coming from 
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the vision module. As an example, consider the case in which a task needs to refer 
to ‘a small red Mercedes.’ The SIM system has to link two types of data: on the 
one side, the description containing the symbols ‘red,’ ‘small’ and ’Mercedes’ 
received from the symbolic system; and on the other side, the values of the 
measurable features (HSV, length, etc.) of observed cars which are sent by the 
vision system. Anchoring implies to convert these representations to a common 
frame, and to find the car that best matches the description. In our case, we 
have chosen to convert symbols to the values used by the vision system. 



4.1 Uncertainty Representation 



In general, both the symbolic descriptions and the data coming from the vision 
system are affected by several types of inexactness. Symbolic descriptions use 
linguistic terms like ‘red’ and ‘small’ that do not denote a unique numerical 
value. Fuzzy sets are commonly considered to be an adequate representation of 
linguistic terms [10,4], so in our system we have chosen to map each symbol of 
this kind to a fuzzy set over the relevant space. For example, we associate the 
term ‘red’ to three fuzzy sets: one for the hue characterizing the tint of color, 
one for the saturation characterizing the purity of the color, and one for value 
characterizing its intensity. Fig 2 (left) shows the fuzzy set for the hue. This fuzzy 
set is interpreted as follow: for each possible value of hue h, the value of red{h) 
measures, on a [0, 1] scale, how much h can be regarded as ‘red’. 

As a second example shows how we represent the linguistic term ‘small- 
Mercedes’ by a set of fuzzy sets over the space of the possible values of length, 
width and area of the car. Fig 2 (right) shows the fuzzy set for the area. The 
reason why we consider the term ‘small-Mercedes’ and not just ‘small’ is because 
what should be regarded as ‘small’ depends on the type of car we are talking 
about. In practice, we use a database that associates each car type to its typical 
length, size, and area, represented by fuzzy sets. Cars of unknown types are 
associated with generic fuzzy sets, like the ‘small’ (car) shown by the dotted 
lines in the picture. In our implementation, we only consider trapezoidal fuzzy 
sets for computational reasons. 




Fig. 2. Fuzzy sets for representing the symbols ‘red’ and ‘small-Mercedes.’ 
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Data from the vision system are affected by uncertainty and imprecision in 
several ways. Consider the measurement of the area of an observed car. Roughly, 
this measurement is done by identifying the edges of the car in the image, count- 
ing the pixels occupied by the car, and converting this count to a metric measure 
by some geometric computations. There are a number of factors that influence 
the correctness of the measured value. First, the discretization of the image limits 
the precision of the measure. Second, the measurement model may be inaccurate: 
for instance, cars are assumed to be rectangular, but this is not completely true 
in reality. The impact of both effects depends on the size of the car in the image, 
which in turn depends on its distance from the camera and on the focal length of 
the camera. Third, the measurement is affected by the perspective distortion due 
to the angle between the car plane and the optical axis: if the car is not perpen- 
dicular to the optical axis, its projection on the image will be shorter. Fourth, 
the geometric parameters needed to compute the length may not be know with 
precision: for example, the angle between the car and the optical axis depends 
on the inclination of the road and of that of the car, both of which are hard to 
evaluate. Finally, the measured value can be totally invalid if there has been an 
error in the identification of the edges of the car in the image; for instance, if 
the car has been merged with its shadow, or with another car in front of it. 

The above discussion reveals that there is a great amount of uncertainty that 
affects the measured value for the length of an object; and that this uncertainty 
is very difficult to precisely quantify — in other words, we do not have a model of 
the uncertainty that affects our measures. Similar observations can be made for 
other features measured by the vision system: for example, the measurement of 
the color of an object is influenced by the spectral characteristics of the light that 
illuminates that object. Given these difficulties nature of the uncertainty in the 
data coming from the vision system, then, we have chosen to use (trapezoidal) 
fuzzy sets to represent the data coming from the vision system. A fuzzy set rep- 
resentation allows us to incorporate heuristic knowledge about the inexactness 
that affects our measures by choosing a specific shape for the corresponding fuzzy 
set. For example. Fig 3 show the fuzzy sets that represent the observed hue and 
area of a car object, respectively. The measure of the hue is rather precise in 
this case. The construction of the fuzzy set for the area value goes as follows.^ 
The vision system has computed the interval [9.2, 10.8] as the possible values 
for the area (this is an interval because of the image discretization). Knowing 
that the viewing angle between the camera and the car is about 30° , we slightly 
increase the upper bound of the interval, meaning that the car might be bigger 
than it appears. We then take this interval to be the core of our fuzzy set, and 
we “blur” the edges to account for the other possible sources of errors. Although 
the definition of the fuzzy sets used to represent measured features is mostly 
heuristic, it has resulted in good performance in our experiments. 



^ These fuzzy sets are currently constructed inside the SIM from the values provided 
by the vision system. We are now in the process of moving this construction into the 
vision system itself, where more informed heuristics can be applied. 
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Fig. 3. Fuzzy sets for hue (left) and area (right) obtained from the vision system. 



4.2 Fuzzy Matching of one Feature 

Once we have represented both the desired description and the observed data by 
fuzzy sets, we can compute their degree of matehing using fuzzy set operations. 
This choice is justified in our case since fuzzy sets can be given a semantic 
characterization in terms of degrees of similarity [6]. There is however a subtle 
difference between the notion of similarity and our intended notion of matching. 
Consider two fuzzy sets A and B over a common domain X which respectively 
represent the observed data and the target description. The degree of matching 
of A to B, denoted by match(A, B), is the degree by which the observed value A 
can be one of those that satisfy our criterium B. Thus, matching implies some 
sort of overlap between A and B, but it does not require that A and B have a 
similar shape. Moreover, matching is not required to be commutative. 

In our work, we have tried two different definitions for a degree of matching 
(see, e.g., [1] for these and alternative definitions). In the first one, we measure 
how much A and B intersect by measuring the height of An B. This gives us 
the following degree: 



matchi(A,B) = sup min{A(a;), S(a;)} 
xex 



( 1 ) 



In the second definition, we measure of how much A is a (fuzzy) subset of B by 
comparing the area oi An B and the area of B: 



match2(A, B) 



lex B(x) dx 



(2) 



Different definitions can be obtained using T-norm operators other than min. 

The degrees of matching defined by equations (1) and (2) behave in two 
essentially different ways. Matchi only depends on the existence of some common 
elements between A and B, while match 2 compares how much of A is inside B 
with how much of A is outside B. The difference is graphically illustrated in 
Fig. 4. When the cores of A and B have no common points (left), both definitions 
provide a degree of matching smaller than 1. As soon as the cores intersect (mid 
and right), matchi always indicates total matching, while matchi gives us only a 
partial degree whenever A is not entirely contained into S. In a sense, definition 
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matchi = 0.7 matchi = 1.0 matchi = 1.0 

match2 = 0.4 match2 = 0.8 match2 = 1.0 



Fig. 4. Three examples of partial matching between a set A and a reference set B. 



(1) tells us how much the observed value may satisfy our criterium B; while 
definition (2) tells us how much the observed value must satisfy it. 

Measure (2) is more discriminating, and it has provided superior empirical 
results in our domain. We have thus adopted this measure in our system. For 
computational reasons, however, we approximate (2) by the ratio between the 
area of the inner trapezoidal envelope oi A and the area of B. When A and 
B are trapezoidal fuzzy sets, both areas can be easily computed from the four 
parameters of the trapezoids. 



4.3 Fuzzy Matching of Several Features 

Once we have computed a degree of matching for each individual feature, we 
need to combine all these degrees together in order to obtain an overall degree 
of matching between a description and a given object perceived by the vision 
system. The simplest way to combine our degrees is by using a conjunctive 
type of combination, where we require that each one of the features matches the 
corresponding part in the description. Conjunctive combination is typically done 
in fuzzy set theory by T-norm operators [8,4], whose most used instances are 
min, product, and the Lukasiewicz T-norm max(x+y — 1, 0). In our experiments, 
we have noticed that the latter operator provides the best results. (See [2] for 
an overview of alternative operators.) 

The overall degree of matching is used by the SIM to select the best anchor 
among the candidate objects provided by the vision module. For each candidate, 
the SIM first computes its degree of matching to the intended description; then 
it ranks these candidates by their degree, and return the full ordered list to the 
reactive executor. Having a list of candidates is convenient if the currently best 
one later turns out not to be the one we wanted. Also, it is useful to know how 
much the best matching candidate is better than the other ones: if the two top 
candidates have similar degrees of matching, we may decide to engage in further 
exploratory actions in order to disambiguate the situation before committing to 
one of them — for instance, we may request the vision system to zoom on each 
candidate in turn in the hope to get more precise data. 

While conjunctive T-norm combination has produced a satisfactory behavior 
in our preliminary experiments, there are a few reasons why more complex types 
of combinations seem more adequate to our case. First, some of the features 
are more critical than others, and we would like their degree of matching to 
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have a stronger impact on the overall degree. Second, in some situations some 
values are known not to be reliable and should have little impact on the overall 
degree of matching: for instance, the observed size of the car is not reliable 
when the viewing angle is large. Finally, some features have errors which are 
strongly correlated (e.g., length and width) and it might be wise to combine 
their individual degrees of matching by an idempotent operator. The search for 
a more adequate combination technique is part of our current development. 



5 Fuzzy Signature Matching at Work 



We illustrate the use of the fuzzy signature matching by two examples on the 
scenario taken from the WiTAS project shown in Fig. 5. In this scenario, the 
deliberative system is interested in a red car of a specified model in the vicin- 
ity of a given crossing. Four cars are situated around that crossing, moving in 
different directions. The cars are all red, but of different models: a small van, a 
big Mercedes, a small Mercedes, and a Lotus. Discriminating between these cars 
is made more difficult by the fact that the helicopter views the crossing at an 
inclination of about 30 degrees: this results in some perspective distortions, thus 
introducing more uncertainty in the extraction of geometrical features. 




Fig. 5. The simulated scenario for our examples. 
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In our first example, the deliberative system decides to follow ‘Van-B’, which 
is described as a red van. The SIM sends the prototypical signature of a red van 
to the vision module. Since all the four cars in the image are red, and they have 
fairly similar shapes, the vision module returns the observed signatures of all 
the four cars to the SIM. These signatures are then matched against the desired 
signature by our routines, resulting in the following degrees of matching: 



ID 


Color 


Shape 


Overall 


66 


1.0 


0.58 


0.58 


67 


1.0 


0.38 


0.38 


68 


1.0 


1.0 


1.0 


69 


1.0 


0.0 


0.0 



The ID is a label assigned by the vision system to each car found in the image. 
The degree of matching for the color is obtained by combining the individual 
degrees of hue, saturation, and value; in our case, this will be 1.0 for all the cars 
as they are all red. The degree of matching for the shape is the combination of 
the individual degrees of matching of length, width, and area. The overall degree 
is the Lukasiewicz combination of the color and shape degrees. In this case, car 
68 is correctly^ identified as the best candidate, and an anchor to that car is 
thus returned to the deliberation system. 

In the second example, the deliberative system is interested in ‘Car-D’, a red 
small Mercedes. The SIM sends the corresponding prototypical signature to the 
vision module, and again gets the signatures of all the four cars in the image 
as an answer. In this case however, the helicopter is at a long distance from 
the crossing and it views the crossing at an inclination of about 30 degrees. By 
applying our fuzzy signature matching routine, we obtain the following degrees: 



ID 


Color 


Shape 


Overall 


66 


1.0 


0.65 


0.65 


67 


1.0 


0.84 


0.84 


68 


1.0 


0.0 


0.0 


69 


1.0 


0.97 


0.97 



Cars 66, 67 and 69 match the desired description to some degree, while car 68 can 
safely be excluded. The SIM tries to improve the quality of the data by asking 
the vision module to zoom on each one of cars 66, 67, and 69 in turn. Using 
the observed signatures after zooming, the SIM then obtains the new degrees of 
matching: 



ID 


Color 


Shape 


Overall 


66 


1.0 


0.30 


0.30 


67 


1.0 


0.70 


0.70 


69 


1.0 


0.21 


0.21 



The closer view results in a smaller segmentation error, since the scale factor 
is smaller, and hence in more narrow fuzzy sets. As a consequence, all the de- 
grees of matching have decreased with respect to the previous observation. What 

^ This verification was done manually off-line. 







114 



Silvia Coradeschi and Alessandro Saffiotti 



matters here, however, is the relative magnitude of the degrees obtained from 
comparable observations, that is, those which are collected in the above table. 
The SIM sends the identifiers of each of the car to the reactive executor together 
with their degrees of matching. These degrees allow the reactive executor to 
select car 67 as the best candidate. 

The reactive executor now has the option to try to further improve its choice 
by commanding the helicopter to fly over car 67 and take another measurement 
from above the car — the best observation conditions for the vision system. If 
we do this, we finally obtain a degree of matching of 1.00 for car 67. Note that 
this degree could as well have dropped, thus indicating that car 67 was not really 
the car that we wanted. In this case, the reactive executor could have requested 
the SIM to go back to cars 66 and 69 to get more accurate views. 

6 Conclusions 

Anchoring symbols to the physical objects that they are meant to denote requires 
the ability to integrate symbolic and numeric data under uncertainty. Although 
anchoring is rarely identified as a clearly separated process, we believe that this 
process must be present in any embedded symbolic system, including most of 
the current autonomous robots. In this paper, we have considered an instance of 
the anchoring problem in which we link the car identifiers used at the decision- 
making level to the perceptual data provided by a vision system. We have shown 
that explicitly representing and reasoning about the uncertainty in this problem 
improves the results of the anchoring process. 

Our experimental results show that our technique is adequate to handle the 
ambiguities that arise when integrating uncertain perceptual data and symbolic 
representations. In particular, the use of fuzzy signature matching definitely im- 
proves our ability to discriminate among perceptually similar objects in difficult 
situations (e.g., perspective distortion). Moreover, degrees of matching allow us 
to exclude unlikely candidates, and to rank the likely ones by their similarity 
to the intended description. Finally, degrees of matching can help in decision 
making; for example, when these degrees indicate a large amount of anchoring 
ambiguity, the system may decide to engage in active information gathering such 
as zooming and getting closer to the object in order to obtain better information. 
It should be noted that the fuzzy logic techniques that we use in our system are 
not novel. The main novelties of our work are the explicit use of the notion of 
anchoring to integrate symbols and sensor data, and the extension of this notion 
to take uncertainty into account. 

The work reported in this paper is still in progress, and many aspects need 
to be further developed. First, we need to study more sophisticated forms of 
aggregation of the individual degrees of matching of different features into an 
overall degree. Second, we plan to include features of a different nature into the 
matching process, like the observed position and velocity of the cars. Finally, 
until now we have only performed experiments in simulation. At the current 
stage of development of the WiTAS project, the vision system takes as input the 
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video frames produced by a 3D simulator. Although this configuration results in 
some amount of noise and uncertainty in the extracted features, we are aware 
that a real validation of our technique will only be possible when we have access 
to the real data from an embarked camera. 
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Abstract. Recent extensions of classical belief change processes tend to 
get closer to numerical estimation techniques. A thorough investigation 
is proposed, in which Kalman filtering is confronted to classical AGM 
revision and KM update and to more recent approaches. The aim is to 
identify common aspects and differences, and to highlight the progressive 
evolution of belief change processes towards what would be a symbolic 
transposition of numerical estimation tools. 



1 Introduction 

Numerical update used in control theory has received little attention in the 
literature dedicated to belief change. Only Boutilier, Friedman and Halpern [3, 
4] noticed that some proposed models of belief change have common aspects 
with approaches developed within the context of stochastic dynamic systems, 
and moreover, Dubois and Prade [11] announced that the debate between update 
and filtering was open. 

In [6], a first approach to a symbolic framework resulting from a transposition of 
Kalman filtering was proposed to deal with uncertainty in situation assessment. 
It appears that such an approach is inspired both by numerical estimation tools 
used in control theory and by belief change tools. As a contribution to the debate, 
the aim of the present paper is therefore to investigate both tools and compare 
the concepts and processes that are involved. 

Kalman filtering [15, 16] is a numerical estimation technique that is widely 
used for tracking and control. It allows the state of a dynamic system subject 
to deterministic and random inputs to be estimated from observations spoilt 
with stochastic errors, thanks both to a state evolution model and an observa- 
tion model including random noises. In a multimodel context, the global state 
estimate results from a weighted sum of local state estimates that are calculated 
separately by different Kalman filters associated with each possible model [19, 
2 , 22 ]. 

Within the logical framework, belief change consists in characterizing the 
evolution of a belief set when new information is delivered. Two types of belief 
change operations are classically distinguished: revision and update. Revision 
characterizes a belief change resulting from new information relative to a static 
world, whereas update results from a change occurring in a dynamic world. 
In order to characterize any rational revision operator, Alchourron, Gardenfors 
and Makinson (AGM) [1] proposed a set of postulates that have become one of 
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the standard frameworks for belief change. Katsuno and Mendelzon (KM) [17] 
then provided an axiomatization, i.e. a set of postulates, for any rational update 
operator. Many extensions, critiques and refinements of the theories of AGM or 
KM have been proposed for the last ten years. 

Despite the fact that Kalman filtering and belief change techniques have 
been studied within two scientific communities that have a priori no connec- 
tion with one another, a thorough investigation has to be made now in order 
to identify what they have in common and how they differ from one another. 
Therefore, mono- and multi- model Kalman filtering will be compared to revi- 
sion and update postulates and to their extensions. This comparison is the first 
step towards the design of a symbolic estimation tool for real-world applications. 

N.B.: (i) The reader is assumed to be more familiar with belief change tools than 
with Kalman filtering; therefore more recalls will be made on the latter. 

(a) Let T be a propositional logic language built on a finite set of propositional 
variables. Formulas are denoted tl>, (j), p... A belief set K is represented by a 
propositional formula tl>, such as K is the deductive closure oi tp. fi denotes the 
set of all the interpretations. 

(in) Because the word “model” is polysemous in this paper, we will call model 
a representation of the behaviour of a dynamic system, and C-models the inter- 
pretations that make a logic theory true in the usual sense. 

2 Kalman Filtering vs Classical Revision and Update 

2.1 Kalman Filtering: Recalls 

N.B.: only the basic discrete-time Kalman estimator is considered, details about 
the continuous-time Kalman estimator and extensions can be found in [15]. 

Monomodel discrete-time Kalman filtering (KF) rests on the assumption of a 
system represented by a state vector Xk at time k, whose dynamics is described 
by a state evolution model Xk+i = F^Xk +Uk+Vk with a deterministic matrix, 
Uk the input of the system and Vk a gaussian noise (state noise) . xt is observed 
through measure Zk thanks to an observation model Zk = Flj^Xk + Wk with Fl^ 
denoting a deterministic matrix and wu a gaussian noise (observation noise). 
Vk and Wk are zero mean, white random sequences with covariances Qk and 
Rk respectively. The aim is to compute the state estimate Xk\k at time k from 
measures zq... Zk, associated with an estimation error covariance matrix Pk\k, 
that minimizes the mean-squared error E[{xk — Xk\k)’^{xk ~ ^*1*)]- 
This is achieved through a two-step process. The first step consists in predicting 
both the next state estimate Xk+i\k from the current state estimate Xk\k thanks 
to the state evolution model, and the next observation Zk+i\k thanks to the 
observation model, noises being assumed equal to zero: Xk+i\k = FkXk\k +Uk and 
Zk+i\k = Hk+iXk+i\k- The state prediction covariance is then T);._|_i|^. = FkPk\kF^ 
+Qk- The second step consists in comparing the actual observation Zk+i at k + 1 
with prediction Zk+i\k, and in correcting predicted state Xk+i\k accordingly. The 
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correction depends on error {zk+i — -2i;+i|i;) called innovation, on state prediction 
error Pk+i\k and on the noises through the gain matrix 

{ ^i;+l|i;+l = Xk+l\k + Kk+1 {zk+1 ~ Zk+l\k) 

Kk+1 = Pk+l\k (Hk+1 Pk+l\k + Rk+l)~^ 

Pk+l\k+l = Pk+l\k — Kk+lHk+lPk+l\k 

In fact, new state estimate Xj.+i\k+i intuitively corresponds to a weighted sum 
of predicted state Xf.j^i\k and state x^j^i^k+i computed from actual observation 
ZkJri (assuming there is no observation noise), such that weights are inversely 
proportional to the dispersion of the variables (see Fig. 1). 

In a multimodel context, the global state estimate results from a weighted 
sum of local state estimates that are calculated separately by different Kalman 
filters associated with each possible state evolution model (see Fig. 1). Bayesian 
methods [19] aim at identifying the model of a system, which is assumed to be 
constant, among a set of possible models, whereas hybrid markovian methods 
[2, 22] deal with changing models, and therefore with sequences of models. 





KF : Kalman filtering 



^ weighted sum 




, new global 
'' state estimate 

□ 



state estimates new state estimates 

Multimodel Kalman Hltering 



Fig. 1. Mono- and multi- model Kalman filtering 



2.2 Comparison on Principles 
• Estimates vs beliefs, measures vs information 

Just as state estimate x^j^i^k+i is a (numerical) representation of the world 
computed from both previous state estimate Xf.\k and measure Zk+i, the belief 
set <j) resulting from revision or update is a (symbolic) representation of the world 
computed from both initial belief set and information /r, denoted as ^ o yu or 
ijj o l_i respectively. Belief set (f> and state estimate Xk+i\k+i have therefore the 
same status. 

This is not the same for measures and information. As far as Kalman filtering 
is concerned, state Xk+i and measure Zk+i belong to two different spaces and an 
observation model is needed. In contrast, new information ja can be integrated 
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as it is into belief set (f) when considering classical revision or update: the world 
is assumed to be directly observable, no observation model is needed. 

Remark : Kalman filtering and belief change processes only aim at estimating a 
numerical or symbolic representation of the world. They are not decision pro- 
cesses such as POMDPs [5], even if delivered estimations can be used as inputs 
for such processes. 

• Monomodel Kalman filtering vs classical revision 

The basic assumptions of Kalman filtering and revision are not the same: the 
former assumes a dynamic world, whereas the latter assumes a static world. 
However, the second step of Kalman filtering is similar to a revision process: 
thanks to new measures, the predicted state is corrected. 

Remark : Kalman filtering could be compared to revision in a very particular 
case, namely when is the identity matrix, and uu = vu = 0: the system 
does not evolve {xu = x does not depend on k) and the successive observations 
Zk enable the knowledge about x to be refined. Nevertheless, Kalman filtering 
process is useless in this trivial case. 

• Monomodel Kalman filtering vs classical npdate 

In contrast, Kalman filtering and update are both relative to the evolution of the 
world. However, the evolution principles are not similar. Kalman filtering relies 
on the notions of anticipation and prediction: state evolves, even if no state 
observation is delivered. Conversely, classical update is based on inertia and 
expectation: the belief set is likely to evolve only if new information is delivered. 

• Multimodel Kalman filtering vs classical revision and update 

For a more relevant comparison, let us consider the semantic characterizations of 
revision and update proposed by Katsuno and Mendelzon [17]: revision consists 
in selecting, in a global way, the /1-models of new information p that are the 
closest to the /1-models of initial belief set tpy update consists in selecting, for 
each £-model of ip, the set of the closest /1-models of /r. The local behaviour 
of update, in contrast to the global behaviour of revision, is one of the main 
differences between both operators (see Fig. 2). 




O C -models of V 
O C -models of |i 

c" > Set of -models of Vo|T 
- - o global evolution 

Set of -models of Vo)T 
— local evolution 



Update 




Fig. 2. Revision and update 



As far as multimodel Kalman filtering is concerned, the local state estimates 
that (numerically) compose the global state estimate evolve apart from each 
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other. Therefore, classical update and multimodel Kalman filtering are both 
characterized by independent evolutions of T-models/local state estimates that 
compose the updated belief set/global state estimate. 

• Markovian assumption 

The state evolution model in Kalman filtering is markovian, as state only 
depends on state and parameters at time k. However, Kalman filtering itself is 
non-markovian w.r.t. state estimation, as the correction step at A: + 1 involves 
parameters {Pk+i\k, and consequently K^+i) that depend on the whole history of 
the process, from 0 to k. In contrast, classical revision and update are markovian 
w.r.t. belief set estimation, since both processes only depend on belief set and 
on new information fi. 

Remark : nevertheless, the estimation of pair {xk+i\k+i, Pk+i\k+i) thanks to Kal- 
man filtering is markovian because of the iterative structure of the equations. 

A more detailed comparison involving AGM revision postulates rephrased by 
Katsuno and Mendelzon, and KM update postulates [17] is now proposed. Due 
to space limitation, only the first three postulates are dealt with in this paper. 



2.3 Kalman Filtering vs Classical Postulates 
• Rl, U1 

The success postulate, which characterizes both revision -'tpofi implies p (Rl) - 
and update - '(pop implies yu (Ul) - relies on the assumption that new information 
p is perfect. Therefore p is always entirely integrated into belief set tp. For 
Kalman filtering on the contrary, noises (vk, Wk) are taken into account within 
both the state evolution and the observation models, thus resulting on a weighted 
integration of new measures (see Fig. 3): there is no a priori preponderance 
of observation over prediction and aberrant measures may be rejected through 
a validation process performed before the correction step and consisting in a 
statistical test based on the computation of the covariance of the innovation 
[21]. Only measures that lie within a certain predetermined error bound are 
accepted. 



initial belief set ^ measure |i 



Revision f v / \ Update 

/o\ 



revised belief set updated belief set 




Kalman Altering 



Fig. 3. Success postulate 
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Remark : Friedman and Halpern [13] have underlined the fact that the justifica- 
tion of the success postulate depends on the ontology adopted in the character- 
ization of the studied process. Recently, in collaboration with Boutilier [4], they 
have proposed an approach in which the success postulate is dropped. 

• R2, U2 

The second postulate is a principle of inertia or minimal change. 

Revision postulate (R2) - if i/j A is satisfiable, then ijj o -h- ^A/r- 
states that if new information /r is consistent with initial belief set tl>, therefore 
revision simply amounts to add to tp. The equivalent notion of consistency 
in the numerical field is the fact that measure z^+i is accepted through the 
validation process. If it is not, Kalman filtering is still possible, but reduced to 
the prediction step. 

Update postulate (U2) - if tp implies p, then ip o p is equivalent to ip - 
states that if new information /r is a consequence of initial belief set tp, then xp 
does not evolve. In particular, no update can make xp consistent if it is initially 
inconsistent. In contrast, as Kalman filtering involves a state evolution model, 
state varies with time according to this model even if there is no new measure 
(see Fig. 4). 



U2 
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(Kalman filtering) 
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Fig. 4. Principle of inertia 





Furthermore, according to the different noises, the weight of the measure can 
be preponderant in the correction step and the filter converges even in case of 
state initialization errors. Nevertheless, the fact that the filtering process itself 
is non-markovian results in an inertia beside state evolution: the process may 
react slowly, or even diverge (e.g. multimodel bayesian methods) if state evolves 
too sharply. 

Remark : (U2) is not always justified in case of evolutive systems. Therefore, 
Dubois, Dupin de Saint-Cyr and Prade [10] have proposed a less restrictive 
formalization of update operators in which (U2) is dropped. The set of postu- 
lates they propose can be applied to approaches based on an explicit structure 
of transition (see for example paragraph 4.1), in which the inertia principle is 
rejected by definition. 

• R3, U3 

Revision postulate (R3) - if p is satisfiable, then tp o p is also satisfiable - states 
that revision is always possible if new information p is satisfiable. This notion 
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cannot be transposed into the numerical context as measure Zk+i cannot be 
validated without refering to models and previous estimations. 

Update postulate (U3) - if both and are satisfiable, then '(jjo is also sat- 
isfiable - states that update is always possible if belief set ^ and new information 
are both satisfiable. This supposes that transitions exist between all possible 
consistent belief sets. In contrast, Kalman filtering is always possible between k 
and k + 1, even if either estimation at k is bad (i.e. error covariance is 

high) or measure at A: + 1 is aberrant. Nevertheless, as Kalman filtering is 

based on a state evolution model, a given state is not reachable from any state. 
Remark : in the update formalization proposed in [10], (U3) is modified in order 
to be less restrictive. Some transitions between belief sets are no longer possible. 

Classical revision and update are one-step, non-iterated distinct processes 
that deal with perfect observations. Some extensions have been proposed for the 
last ten years, where limitations of AGM or KM theories have been highlighted. 
These approaches tend to get closer to numerical estimation, without necessarily 
refering to it. Some of them are studied in the sequel. 

3 Kalman Filtering vs Iterated Revision 

Most of revision operators satisfying AGM postulates suppose an underlying 
preference relation, denoted as TZ in the sequel, at the formula [1] or at the 
T-model level [17]. However, the preference relation is lost during the revision 
process. Consequently, several authors e.g. [18, 9, 20] have stated that AGM the- 
ory of revision does not support iterated revision (IR): in order to deal with 
successive observations, not only beliefs have to be revised, but also the revision 
strategy encoded by TZ. The proposed frameworks for iterated revision then rely 
on a common principle of revision achievement at the epistemic state level where 
epistemic states include both belief sets and revision strategy. They consist in re- 
vising the underlying preference relation TZ when new information ^ is delivered. 
The revised preference relation TZ o-j^ fi determines a new partial order on for- 
mulas or possible worlds: the revised belief set ^ ^ o /x is deduced from the 

most preferred formulas or T-models of /i, and less preferred formulas or possible 
worlds represent the dispersion characterizing the uncertainty associated with 
the state of the world (see Fig. 5). 

As for Kalman filtering, the state is assumed to be a gaussian variable 
the best state estimate Xi\j is chosen as the most probable value, 
i.e. the mean of the gaussian distribution, and covariance characterizes the 
dispersion of possible values around it. Such a gaussian distribution M{xi\j,Pi\j) 
corresponds to a numerical preference relation: the most preferred value is the 
best state estimate Xi\j, and characterizes possible but less preferred values. 

Within the correction step, new measure allows the predicted state gaus- 
sian distribution A/’(a:i;_|_i|i;,Pi;_|_i|i;) to be corrected. Consequently, the correction 
step corresponds to the numerical revision of the numerical preference relation 
encoded by A/’(a;^._|_i|^., PkJri\k)'^ state estimate x^j^i^k+i results from the corrected 
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Fig. 5. Kalman filtering and iterated revision 



gaussian distribution, just as revised belief set ^ o /r results from revised relation 
preference TZ o-j^ jj. 



4 Kalman Filtering vs two-step Approaches 

This section is dedicated to the analysis of two approaches of revision or update 
characterized by a two-step principle that tend to be the closest to the principle 
of Kalman filtering. 



4.1 Transition-Based Update 

Cordier and Lang [7] underline the relation existing between transition-based 
update, as defined by Cordier and Siegel [8], and belief revision. In transition- 
based update, inertia (U2) is not necessarily taken for granted and an explicit 
transition model TR (at the formula level) describes the possible evolutions 
of the world: “sure” transitions, that have to be necessarily satisfied when the 
world changes, characterize a deterministic evolution of the world; “expected” 
transitions characterize the uncertainty associated with this evolution, and may 
be violated. Transition-based update then consists in calculating, for each £,- 
model of initial belief set tl>, the set of T-models of new information ^ that 
minimize “abnormal” changes, i.e. such that sure transitions are satisfied and 
violated transitions are the least expected ones. 

Furthermore, if TR is reduced to “expected” transitions, Cordier and Lang 
have shown that update on a complete belief set is a two-step process (TBU* 
in the sequel): (i) prediction of beliefs from both the currently known facts (i.e. 
the complete belief set tp) and TR, and (ii) revision of predicted beliefs by new 
information /x, which corresponds to a syntax-based inconsistency handling. 

These two steps, prediction and revision, can be compared with the predic- 
tion and correction steps of Kalman filtering. However, observation ^ is still 
entirely integrated and the truth of initial beliefs xp cannot be questioned: pos- 
sible inconsistency after the first step is only linked to transition model TR. In 
contrast, for Kalman filtering, prediction error is linked both to a possible disper- 
sion around the deterministic state evolution and to a possible state estimation 
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error at previous times. Furthermore, TR is stationary, which is not always the 
case for the state evolution model of Kalman filtering. 



4.2 Generalized Update 

Boutilier [3] recently proposed an update process, called generalized update (GU), 
that combines aspects of both revision and update and can deal with successive 
and noisy observations. The key idea is that an observation can reflect the evolu- 
tion of the world but also question previous beliefs: postulat U3 is relaxed in the 
sense that, if information p is not reachable through any “reasonable” transition 
from a /1-model of some initial beliefs of may be wrong. 

The process assumes the existence of a description of the current state of the 
world based on a quantitative ranking k on U associated with a notion of plaus- 
ibility, and a description of the possible occurrence of events and their effects. In 
such a context, GU is a two-step process: (i) the prediction of the ranking on U, 
denoted as and reflecting the possible temporal evolutions of the world due 
to event occurrences, is computed from the initial ranking k; (ii) the revision of 
by new information p results in ranking k*. /1-models of the updated belief 
set tp o i_i are such that = 0 (see Fig. 6). In case of noisy observations, an ob- 
servation model (describing the plausibility of various observations in different 
world states) is taken into account within the ranking revision step. The main 
differences with classical approaches are that £-models oi tp o fi are not neces- 
sarily /1-models of fi and are not necessarily reachable from the initial belief set 
xp, which consequently may be questioned. 
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Fig. 6. Generalized update with noisy observations 



Boutilier draws parallels between generalized update and bayesian update in 
stochastic dynamic systems and states that “GU can be viewed as a qualitative 
form of bayesian update, with the /t-calculus playing the role of probabilistic 
laws”. Since the equations of Kalman filtering can be proved thanks to Bayes rule, 
Kalman filtering and GU have many things in common: both assume an evolution 
and an observation models, both are based on a prediction-correction principle, 
both are non-markovian w.r.t. state or belief set estimation but markovian w.r.t. 
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the gaussian distribution associated with state or the ranking estimation, both 
refute the success postulate. However, the evolution and observation models are 
stationary in the GU framework, and models cannot be questioned: events are 
well-defined and no unexpected event can occur. As for Kalman filtering, models 
can be non-stationary and ill-defined and state noise is taken into account in 
order to model a random dispersion around the deterministic evolution. 

Let us notice that the belief change model proposed by Friedman and Halpern 
[14] is close to Boutilier’s, when stationarity assumption (the system dynamics 
and observation model do not change over time) and Markovian assumption 
(dynamics are independent of history) are set. 



5 Summary and Further Research 



The main aspects of the comparison of Kalman filtering (KF) with AGM re- 
vision and KM update, iterated revision (IR), transition-based update (TBU*) 
and generalized update (GU) are summarized in the table (see Fig. 7). It can 
be noticed that one-step revision and update processes have evolved towards 
a unique two-step iterative model-based process combining both aspects, that 
tends to get closer to numerical estimation techniques such as Kalman filter- 
ing. It is based on the evolution of a preference relation that allows a “symbolic 
weighting” of previous beliefs and new observation to be achieved: previous be- 
liefs may be revised, and observations (that may be noisy) may not belong to 
the resulting belief set. 

As far as real-world symbolic estimation problems are concerned - e.g. situ- 
ation assessment for surveillance or decision making - the features that are re- 
quired come both from GU and KF: 



— models have to be non-stationary and have to include both deterministic and 
random components (as perfect models of what is likely to happen cannot 
be designed); 

— in a multimodel context, in case of aberrant observations, i.e. observations 
that cannot be matched with the current beliefs and models, the creation of 
new belief sets and models has to be allowed (as unexpected but mission- 
relevant things may happen in the world); 

— the preference relation has also to evolve independently of event occurrences 
and observations, so as to encode the notion of belief erosion [12] (as things 
in the world change even if no sensor watches them) . 



Further research is dedicated to the formalization of these requirements. 
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1 Success Postulate \ 


Info entirely integrated 


Noisy observations 


Info may not 
belong to Ip o fi 


Weighted integration 
of measure 


1 Markovian assumption \ 


Markovian 
(no iteration) 


Non 

markovian 


Markovian 
(no iteration) 


Non markovian process 
w.r.t. belief set or state estimation 
Markovian process 
w.r.t. the preference relation or 
the gaussian distribution estimation 


1 Status of initial beliefs \ 


R: inconsistency due to initial beliefs 
U: initial beliefs never questioned 


Possible revision 
of previous 
beliefs 


Weighted integration 
of previous 
estimations 


1 Temporal behaviour \ 


R: static world 
U: changing world 
but inertia, expectation 


Observation-independent temporal prediction 
then correction thanks to observation 



Fig. 7. Summary 
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Abstract. This paper investigates Walley’s concepts of irrelevance and 
independence, as applied to the theory of closed convex sets of probability 
measures. Walley’s concepts are analyzed from the perspective of axioms 
for conditional independence (the so-called semi-graphoid axioms). Two 
new results are demonstrated in discrete models: hrst, Walley’s concept 
of irrelevance is an asymmetric semi-graphoid; second, Walley’s concept 
of independence is an incomplete semi-graphoid. These results are the 
basis for an understanding of irrelevance and independence in connection 
to the theory of closed convex sets of probability measures, a theory that 
has received attention as a powerful representation for uncertainty in 
beliefs and preferences. 



1 Introduction 

The purpose of this paper is to investigate technical properties of Walley’s con- 
cepts of irrelevance and independence. These concepts, connected to the theory 
of closed convex sets of probability measures, have several intuitive properties, 
but have not received a complete treatment in the literature. 

Closed convex sets of probability have been used to represent imprecision in 
probability models or to represent divergences in group decision-making [10, 12]; 
currently, the main practical application of these models is the held of robust 
Bayesian Statistics [2,8,14]. Under various names, such as imprecise probabil- 
ity theory [12] or Quasi-Bayesian theory [7], closed convex sets of probability 
measures have received extensive technical analysis. In particular, several con- 
cepts of independence in Quasi-Bayesian models have emerged [3,6]. Recently, 
intuitive dehnitions of irrelevance and independence were proposed by Walley 
[12, Chapter 9], dehnitions that can be explained in a direct manner from basic 
axioms of Quasi-Bayesian theory. 

The purpose of this paper is to investigate Walley’s concepts of irrelevance 
and independence from the perspective of the so-called semi-graphoid axioms. 
These axioms are meant to capture the subtle idea behind “independence.” In 
fact, many important concepts of independence, like probabilistic independence 
and graphical independence, are semi-graphoids. 

Two new results are demonstrated in this paper in connection to discrete 
Quasi-Bayesian models: hrst, Walley’s irrelevance dehnes an asymmetric semi- 
graphoid; second, Walley’s independence dehnes an incomplete semi-graphoid. 

A. Hunter and S. Parsons (Eds.): ECSQARU’99, LNAI 1638, pp. 128-136, 1999. 
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These results are the basis for a thorough understanding of irrelevance and in- 
dependence concepts in the theory of closed convex sets of probability measures. 

2 Semi-graphoid Axioms 

Take random variables X, Y , Z and IT on a probability space Random 

variables considered in this paper are discrete and functions of random variables 
are bounded (values of all functions are Rnite). In other words, a random vari- 
able X induces a countable partition on an underlying outcome space, and any 
function f{X) considered in this paper is a finite constant on each element of 
the partition induced by X. 

As summarized by Dawid [4], a ternary relation X ILY \ Z is a semt-graphotd 
when it captures the notion that “once the value of Z has been specified, any 
information about Y is irrelevant to uncertainty about A.” The semi-graphoid 
axioms are postulates that capture the essential properties of this abstract no- 
tion. Note that Dawid ignores possible differences between concepts of irrelevance 
and independence. 

The axioms for _LL are: 

A1 If X ALY\Z then Y ALX\Z. 

A2 XdLY\X. 

A3 If X ILY \ Z and IT = g{Y) (a bounded function of T), then X ILW \ Z . 
A4 If X ILY \ Z and IT = g(Y) (a bounded function of T), then X ILY \ (IT, Z). 
A5 If XALY\Z and A_U_IT|(y,T), then XAL{Y,W)\Z. 

Call a structure that satisfies only A2, A3, A4 and A5 an asymmetric semi- 
graphoid. Call a structure that satisfies only Al, A2, A3 and A4 an incomplete 
semi-graphoid. 

3 Quasi-Bayesian Theory 

A closed convex set of probability measures is called a credal set] existence of 
credal sets is derived from axioms about preferences [7]. Convexity of a set K 
means that, given any two measures Pi and P 2 in K, the convex combination 
aP\ + (1 — a)P 2 (for a G [0, 1]) is also in K . To simplify terminology, sets of 
probability densities (defined whenever possible) are also called credal sets. A 
credal set defined by a set of densities p(X) is indicated by K(X). 

Given a random variable X, lower and upper expectations for a function 
f(X) are respectively defined as: 

= „xS(xC.<v(/(V)l . X[/(.Y)1 , 

where Tp(x)[/(^)] is the standard expectation of the function f(X). There is 
a one-to-one correspondence between lower (or upper) expectations and credal 
sets: a collection of lower (or upper) expectations for all arbitrary bounded func- 
tions f(X) defines a credal set K(X). Lower expectations can be obtained from 
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upper expectations through the expression E[f(X)] = —E[—f(X)]. Bounds 
on the probability of any event A can be obtained by taking the lower ex- 
pectation of the indicator function Ia{X), which is one if X ^ A and zero 
otherwise. For a credal set K(X), the function P_{A) = minp^x)^K(x) P{^) = 
minp^x)^K(x) E-[(x)[Ia{X)'\ is called a lower envelope and the function P{A) = 
maxy(x)6if(x) P{Af) = maxy(x)6if(x) E,iX)[lA{X)] is called a upper envelope. 

Closed convex sets of conditional probability measures are employed to rep- 
resent conditional beliefs. The eonditional eredal set K{^X\A) is dehned by con- 
ditional densities p{^X\A). 

Quasi-Bayesian inference is performed by applying Bayes rule to each mea- 
sure in a jont credal set. A posterior eredal set is the union of all posterior 
measures obtained in this process. Consequently, for a bounded function f(X) 
and an event A with positive lower envelope: 



E[f{X)\A] 



. E^x)[fiX)lA{X)] 

min — ^ ^ 

t(X)€K{X) Ej^x)[1a{X)] 



( 1 ) 



If the event A has lower envelope equal to zero, the posterior credal set K{^X\A) 
is taken by eonvention to contain all probability densities p{^X\A). For an event 
A with positive lower envelope, E[f(X)\A] is the unique solution of the following 
equation in A, called generalized Bayes rule by Walley [12, Section 6.4]: 



E[{f{X)-\)lA{X)] = 0, 



where Ia{X) is the indicator function of the event A (to simplify notation, I{X) 
denotes the indicator function for a particular event dehned by a value of A). 
In Walley’s theory, the generalized Bayes rule is derived from basic principles 
of coherent behavior, but the rule can be derived directly from Expression (1). 
The generalized Bayes rule has a number of straightforward properties; the next 
lemma proves one such property that is important for the main results in this 
paper. 



Lemma 1. If P_{W) and PfX) are larger than zero, the value of X that satisfies 
the eguation If[(g(Y) — \)I{W)\X] = 0 is egual to Il[g(Y)\W, X] for an arbitrary 
bounded funetion g{Y). 

There are two possible proofs for this result, one directly derived from the 
concept of convex sets of measures and standard Bayes rule, and the other de- 
rived from the generalized Bayes rule. 



Proof. Using Expression (1): 



E[g{Y)\W,X] 



Ep(Y,w\x)[giY)I{W)\X] 

mm — ^ i ^ 

r(Y,w\x)eK(Y,w\x) Ep(Y,w\x)[IiW)\X] 



Consequently: 



Ef(Y.w\x)[9iY)IiW) - E[g{Y)\W,X]I{W)\X] 

mm — ^ i ^ 

r(Y,w\x)eK(Y,w\x) Ep(Y,w\x)[I{W)\X] 
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This equation has a solution when the numerator is zero, so an equivalent equa- 
tion is obtained: 



min 

p(Y\W,X)eK{Y\W,X) 



EjiY,w\x)[{g{y) - y)I{W)\X] = E[{9iy) - E)I{W)\X] = 0 , 



where A is the unique solution and is consequently equal to ^g{Y)\W, X\ 

Proof. Dehne 7 (A) = fl[(g(Y) — X)I(W)\X], By the generalized Bayes rule, the 
value of 7 (A) is the solution of the equation 



E[{{giY)-E)im-7W)I{X)] = 0- 

Because ff[(g(Y) — \)I{W)\X] is strictly decreasing with A and has positive and 
negative values, there must be A^ such that 7 (A^) = 0. For this value of A, 

E[{{g{Y)-Xx)I{W)-Q)I{X)] = Q. 



Consequently, for A = A^ : 

EMY)-Xx)I{W)I{X)]=Q. 

So, A^ is the unique solution of the equation ^{g{Y) — X f\ = 0, and 

Xx = E[g{Y)\W,X]. 

The concept of independence was not dehned in the original formulation 
of Quasi-Bayesian theory by Giron and Rios [7]; indeed, there is considerable 
controversy regarding judgements of independence in Quasi-Bayesian models 
[1,3,6,9,13]. The results presented in this paper adopt Walley’s dehnition of 
independence because this formulation can be reduced to preference relations, 
following the Quasi-Bayesian philosophy [12]. 

Walley’s concepts can be dehned as follows. Consider X, Y and Z , and the 
joint credal set K(X, Y, Z). Two conditional credal sets K{^X\Y, Z) and /T(X|Y) 
are said equal if any density in K(X\Y' , Z' ) (for any hxed Y = Y', Z = Z' ) also 
belongs to K{^X\Z'), and vice-versa. 

Definition 1. Variable Y is irrelevant to X given Z if K[X\Z) is equal to 
K{X\Y, Z), regardless of the value of Z . 

Definition 2. Variables X and Y are independent given Z if X is irrelevant 
to Y given Z and Y is irrelevant to X given Z . 

Note that Z can be empty; in this case the irrelevance and independence 
concepts are not “conditional” on any variable. 

Walley’s original dehnition of irrelevance is stated in terms of lower expec- 
tations. The dehnitions are equivalent and stem from the equivalence between 
lower expectations and convex sets of measures [12, Chapter 3]; the lower expec- 
tation approach is convenient in several proofs. 

Definition 3. Variable Y is irrelevant to X given Z if fl[f(X)\Y, Z] is equal to 
E[f(X)\Z] for any bounded funetion f(X), regardless of the value of Z . 
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4 Irrelevance and the Semi-graphoid Axioms 

The symbol _LL is described by Dawid as meaning that “any information about 
Y is irrelevant to uncertainy about X” [4]. This description is actually close to 
Walley’s concept of irrelevance, as it does not depend on any symmetry axiom 
like Al. In fact, the next theorem shows that Walley’s concept of irrelevance 
satisRes all semi-graphoid axioms except Al. 

Theorem 1. Walley’s concept of irrelevance is an asymmetric semi-graphoid. 

Proof. Use f(X) to denote an arbitrary bounded function. The axioms A2- 
A5 (understood in terms of Walley’s irrelevance) can be rephrased using lower 
expectations: 

A2’ E[f{X)\Y,X]=E[f{X)\X]. 

A3’ If fl[f(X)\Y, Z] = fl[f(X)\Z] and W = g(Y) (a bounded function of Y), 
then E[f(X)\W,Z] = E[f(X)\Z]. 

A4’ If E[f{X)\Y, Z] = fl[f{X)\Z] and W = g(Y) (a bounded function of Y), 
then E[f(X)\Y, W, Z] = E[f(X)\W, Z], 

A5’ If E[f{X)\Y,Z] = E[f{X)\Z] and E[f{X)\W,Y,Z] = E[f{X)\Y,Z], then 

E[fiX)\W,Y,Z]=MfiX)\Z]- 

These statements can now be proved: 

A2’ Note that 

and also 



A3’ The lower expectation fl[f(X)\Z] is equal to 

mm E^Y\z)[Ep^x\Y,z)[f{X)\Y,Z]\Z]. 

p(X,Y\Z)eK(X,Y\Z) I n I . ; V /I i j 



Consider a probability density p*(X, Y, Z) such that Ep*(^x,Y ,z)[/(A)|Y,Y] = 
M.f(X)\Z] regardless of Y. Such a p* (X\Y, Z) must exist, because the min- 
imum value of Ep(x\Y,z)[f(X)\Y, Z] is E[f(X)\Z] for every value of Y (ir- 
relevance of Y to A given Z) and the minimum of Ep(^x\z)[fiX)\Z] is also 
M.f(X)\Z]. If no p*(X,Y, Z) existed, then the minimum of the expecta- 
tion Ep(^Y\z) [Ep(x\Y,Z)[fiX)\Y, Z] \Z] would be larger than the minimum of 
Ep[x\Y,z)[f(X)\Y, Z] (equal to E[f(X)\Y, Z]). To simplify notation, expec- 
tations calculated with respect to p*(X,Y, Z) have a * superscript. 

Now compute E*[f(X)\W, Z]: 



E*[f{X)\W,Z] 



E* [liw=g(Y)}iY)E*[fiX)\Y,Z] jZ] 
E* [I{w=,(y)}{Y)\Z] 
E[f{X)\Z] E* [I^w=,(y)}{Y)\Z] 

E* [liw=,(Y)}(Y)\Z] 



= E[fiX)\Z]. 
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Note that the expectation produced by p*(X, Y, Z) is actually a lower bound 
because 

minEp^Y\w,z) [l{w=g(Y)]{Y)Ef(x\Y,w,z)[f{X)\Y, W, Z] \W, Z] 
> E^y\W,Z) [l{W=g(Y)}{Y)K[f{X)\Y, Z] \W, z] 

= E^y\w,Z) [hw=g(Y)}(Y)E[f(X)\Z] IIT^, Z] = E[f(X)\Z] . 

Consequently, ^[f(X)\W,Z] = ^[f(X)\Z]. 

A4’ By the generalized Bayes rule, E[f(X)\Y, W, Z] is the value p* that satisRes 

E[{f{X)-p)I{YWZ)\ = {). 

Because I{YWZ) = I{Y Z), the value p* is also the solution of 
E[{f{X)-p)I{YZ)] = Q. 

Consequently, p* = ^[f(X)\Y,Z]. Because ^[f(X)\Y,Z] = ^[f(X)\Z] by 
hypothesis, E\f(X)\Y, W, Z] = E\f(X)\Z] = E\f(X)\W, Z] (by Axiom A3’). 
A5’ By transitivity: E[f{X)\W, Y, Z] = E[f{X)\Y, Z] = E[f{X)\Y]. 

5 Independence and the Semi-graphoid Axioms 

Walley’s concept of independence can conceivably satisfy all semi-graphoid ax- 
ioms, as it certainly displays the symmetry property required by Al. However, 
axiom A5 is not satisfied by Walley’s concept of independence. Denote Walley’s 
relations of irrelevance and independence respectively by _U_ij and JJ-at. The 
following property is not true: 

A5” If A_U_ArY I Y and XALnW \ (Y, Z), then (Y, W}1LrX \ Z. 

An example demonstrates the failure of A5” . Consider three binary vari- 
ables X, Y and W (in this example the variable Z is omitted). Suppose that 
K{X, Y, W) is the convex hull of three joint densities pi{X, Y, W), P 2 (X, Y, W) 
and pXX.Y.W) displayed in Table 1. This credal set satisfies XYlmY and 
A_U_ArIY|Y because: 

A(Ao|Yo) = A(Ao|Yi) = A(Ao) (p(Ao) C [0.2, 0.3]) 

A(Yo|Ao) = A(Yo|Ai) = A(Yo) (p(Yi) e [0.3, 0.4]) 

A(Ao|IYo, Yo) = A(Ao|IYi, Yo) = A ( Aq | IYq , W ) = A ( Aq | lYi , W ) = A(Aq) 
A(IYo|Ao, Yo) = A(IYo|Ai, Yo) = A(IYo|Yo) (p(IYo|Yo) e [0.1, 0.2]) 
A(IYo|Ao, Yi) = A(IYo|Ai, Yi) = A(IYo|Yi) (p(IYo|Yi) e [0.4,0.8]). 

But for h{Y,W) = I{y=Yo,w=Wo}{Y,W), E[h{Y,W)\Xo] = £(Yo,IYo|Ao) = 
0.0372 and E]h{Y,W)] = P_(Yo,Wo) = 0.04. Consequently, it is not true that 
(Y, W}1LrX. 

Despite the failure of AY' , all other semi-graphoid axioms are satisfied by 
Walley’s independence. 
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w 


X 


Y 


Pi {X, Y W) 


P2{X,Y,W) 


psiX,Y,W) 


Wo 


Xo 


Yo 


0.008 


0.018 


0.0093 


Wi 


Xo 


Yo 


0.072 


0.072 


0.0757 


Wo 


Xi 


Yo 


0.032 


0.042 


0.037 


Wi 


Xi 


Yo 


0.288 


0.168 


0.228 


Wo 


Xo 


Yi 


0.096 


0.084 


0.09 


Wi 


Xo 


Yi 


0.024 


0.126 


0.075 


Wo 


Xi 


Yi 


0.384 


0.196 


0.290 


Wi 


Xi 


Yi 


0.096 


0.294 


0.195 



Table 1. Joint densities that define a credal set. 



Theorem 2. Walley’s concept of irrelevance is an incomplete semi-graphoid. 

Proof. Axiom A1 is immediate as Walley’s independence is just a symmetriza- 
tion of Walley’s concept of irrelevance. Take f(Y) and g{W) to denote arbi- 
trary bounded functions. To demonstrate axioms A2-A4 (understood in terms 
of Walley’s independence), only the following facts must be proved in addition 
to statements already proved in Theorem 1: 

A2” E[f{y)\X,X] = ^{Y)\X]. 

A3” If E[f{y)\X,Z] = E[f{y)\Z] and W = g(Y), then E[fiW)\X,Z] = 
E[f(W)\Z]. 

A4” If E[fiy)\X,Z] = E[fiy)\Z] and W = g(Y), then E[fiy)\X,W,Z] = 
E[f(yW,Z]. 

Axiom A2” is immediate. The other ones can be demonstrated as follows: 

A3” Because fl[f(W)\X , Z] is equal to fl[f(g(Y))\X , Z] and fl[f(g(Y))\X , Z] = 
K[f{g{y))\Z] by hypothesis, K[f{W)\X,Z] = E[f{W)\Z]. 

A4” By Lemma 1, E[f(Y)\X, W, Z] is the value of A* that solves the equation 

E[ifiy)-y)im\x,z] = o. 

By hypothesis, ^[{f{Y)-X)I{g{Y))\X,Z] = K[{f{Y)-\)I{g{Y))\Z], so 
A* is also the solution of the equation 

E[(f(Y)-X)I(W)\Z] = 0. 

Consequently, A* = E[f(Y)\X, IT, T] = E[f(Y)\W, Z], 

6 Conclusion 

The properties of Walley’s concept of irrelevance and independence depart in 
various ways from the properties of standard probabilistic independence. The 
most obvious difference is that irrelevance and independence are not identical 
in Walley’s framework. This distinction is not commonly employed in statistical 
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parlance; for example, both Dawid [4] and Pearl [11] justify the semi-graphoid 
axioms using the words irrelevance and independence almost interchangeably. 
Dawid even introduces the properties of the semi-graphoids indicating that “we 
can rephrase these as assertions of irrelevance” [4]. 

But concepts of irrelevance and independence do not have identical meaning 
in our quotidian discourse. The concept of irrelevance is associated with an 
asymmetric notion; if an object has no bearing upon some situation, we are 
content to say that the object is irrelevant to the situation, without further 
analysis of the reciprocal relation. Quite differently, the concept of independence 
seems to be intrinsically tied to a notion of symmetry, as two objects or situations 
are independent only when they are irrelevant to each other. 

Such considerations of symmetry are crucial to appreciate the results in this 
paper. The failure of A1 for Walley’s irrelevance and the failure of A5” for 
Walley’s independence are quite satisfactory under this perspective. 

Axiom A1 is simply a symmetry requirement, so it must fail for irrelevance. 
And it must work for independence, as it does in Walley’s framework. 

The failure of Axiom A5” may seem strange at Rrst, because A5 is taken as 
a very reasonable requirement of independence. But all that is reasonable about 
A5 is contained in the irrelevance property A5’. In fact, Dawid [4] and Pearl [11] 
defend only A5’ when they justify Axiom A5. To quote Dawid: 

P5 [Axiom A5 in the current paper] says that if, knowing Z, Y is ir- 
relevant to X] and also, knowing both Y and Z, W is irrelevant to A; 
then, knowing Z , (Y, W) must be jointly irrelevant to X . However we 
understand “irrelevance” , this seems to be a desirable property. 

Property A5’ is certainly desirable when applied to irrelevance, but it is hard to 
see why its converse A5” must be valid for independence. Suppose that, knowing 
Z, X is irrelevant to Y ; and also, knowing both Y and Z, X is irrelevant to lY; 
then, is it desirable that, knowing Z , X must be irrelevant to (Y, lY)? Why 
would it be impossible for X be irrelevant to each of Y and lY taken separately, 
but still X be relevant to Y and lY taken together? Property A5” only seems 
reasonable when preconceived notions of symmetry are applied to A5’. 

These considerations indicate that Walley’s framework offers an appropriate 
path for the study of irrelevance and independence. The present paper con- 
tributes with an initial step in this path, and suggests that asymmetric and 
incomplete semi-graphoids are quite important structures that deserve special 
attention. 
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Abstract. The problem of assessing the value of a candidate is viewed 
here as a multiple combination problem. On the one hand a candidate 
can be evaluated according to different criteria, and on the other hand 
several experts are supposed to assess the value of candidates according 
to each criterion. Criteria are not equally important, experts are not 
equally competent or reliable. Moreover levels of satisfaction of criteria, 
or levels of confidence are only assumed to take their values in qualitative 
scales which are just linearly ordered. The problem is discussed within 
the framework of possibility theory which offers a qualitative setting for 
handling it. 



1 Introduction 

The problem of assessing the value of a candidate (it may be a person, an object, 
or any abstract entity for instance) is often encountered in practice. It is usually 
a preliminary step before making a choice. Such a problem can be handled in 
different manners depending on what the evaluation is based. One may have 
expert generic rules which aim at classifying candidates in different categories 
(e.g., ’excellent’,’good’,. • • ,’very bad’). One may have a base of cases made of 
previous evaluations from which a similarity-based evaluation of the new candi- 
date can be performed. This corresponds to two popular approaches in Artihcial 
Intelligence (namely, expert systems and case-based reasoning), that one might 
also like to combine. In the following, the value assessment problem is rather 
posed in terms of multiple criteria, whose value for a given candidate can be 
more or less precisely assessed with some level of conhdence by various experts 
(whose opinions are also to be fused) . This problem is sometimes referred to as 
a subjective evaluation process [12]. For a discussion of the relation between the 
rule-based, case-based and criteria-based approaches, the reader is referred to [4, 
7]. An important issue in such a problem is that the assessement of the value of 
a criteria for a candidate by an expert is often linguistically expressed, is quali- 
tative in nature and pervaded with uncertainty. In other words, only ordinal and 
qualitative information is available. It applies to the possible levels of satisfaction 
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of criteria, as well as to their level of importance, or to the level of conhdence in 
expert assessements. For this reason, qualitative possibility theory [8] is chosen 
as a representation framework for handling imprecise values pervaded with qual- 
itative uncertainty. This might be related to the processing of fuzzy marks for 
students’ evaluation for which an (adhoc) treatment was proposed recently [2]. 
An alternative would have been to map the ordinal scale on a suitable cardinal 
scale by a method such as “Macbeth” [11]. However this approach requires in 
fact additional information. 

Note that the problem considered in the following is not simply to rank-order 
candidates on the basis of a set of criteria. Then pairwise comparisons could lead 
to a simple outranking solution. The problem here is to design a procedure where 
multiple experts assessments are faithfully represented and fused. As we shall 
see different families of solutions are possible according to the way the criteria 
are interacting. 

The paper is organized in two main parts. The problem is hrst precisely stated 
and the questions that it raises are pointed out. Then the proposed approach is 
presented and illustrated on an example. 



2 The multiple expert multiple criteria assessment 
problem 

The value of a candidate has to be assessed. For this evaluation, m criteria are 
used. For each criterion, the value of the candidate is assessed (may be impre- 
cisely) by n experts (or sources). Criteria can be rank-ordered according to their 
importance. Each expert provides each of his precise or imprecise evaluations 
with his level of conhdence in it. The general reliability of each expert is qual- 
itatively given by the decision maker (DM) who is interested in the evaluation 
of the candidate. 

Some general comments about the problem which is thus informally stated, 
have to be made. There may be only one candidate to be evaluated (or several). 
This restricts considerably the scope of possible methods. More specihcally, with 
a set of candidates, one may want to i) choose the best one(s); ii) rank all can- 
didates from best to worst; iii) give a partial order between them, with possibly 
some incomparabilities; iv) cluster the candidates in several groups (e.g., good 
ones, bad ones, those with a weak point w.r.t. one criterion etc.). In fact, many 
methods in decision analysis proceed by a pairwise comparison of candidates 
which supposes to know all of them from the beginning. Indeed, one looks here 
for a global evaluation of a candidate which may be unique. In case the prob- 
lem would be to rank-order candidates only, a lexicographic approach (based on 
the comparaison of vectors made of scores of each candidate w.r.t. to all crite- 
ria ordered according to their importance) would be enough. However such an 
approach assumes that a complete order exists between scores (which is not in 
agreement with the fact that the scores may be imprecise and pervaded with 
uncertainty). 
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We now introduce some notations. The candidate is denoted K, or K\, . . . 
if several ones. K is omitted when unnecessary. Criteria are numbered by i: 
i = 1, . . . , m. The true (unknown) value of the level of satisfaction of criterion i 
for the candidate K is denoted Ci{K), with Ci{K) G Lg, where Lg is the ordinal 
scale of levels of satisfaction, e.g. Lg = {1, 2, 3, 4, 5} with l=very bad,. . . , 5=very 
good. An element of Lg is denoted s. 

The evaluation of each criterion for K and a given expert may be imprecise 
or uncertain (either due to the fact that it is unclear to what precise extent 
K satishes criterion i or due to the possible lack of competence in i of the 
expert assessing the value) . Each evaluation will be represented by a possibility 
distribution (discounted in case of limited expertise) restricting the more or 
less possible values of this evaluation. Let c,{k) i for short) the possibility 
distribution restricting the possible values of Ci{K) according to expert j. 

Experts are numbered by j: j = I, ... ,n. i is a mapping from Lg to Lj^. 
In the exemple we shall use Lj^ = {0,a,&, 1} which is an ordinal scale, where 
0 corresponds to impossibility, and II to total possibility. The true (unknown) 
global score of K is denoted c{K) G Lg. The associated possibility distribution 
is T^c{K)j or simply tt. It represents the result of the assessment procedure. Again 
TV : Lg hA T,r- The conhdence level of expert j in his assessment when judging 
criterion i is denoted 'jij, and these levels are dehned on an ordinal scale 
which can be related to Lj^ as we shall see. The conhdence of the decision maker 
into expert j’s opinions are denoted aj, j = I, ... ,n, and the a^’s are dehned 
on an ordinal scale L„. Eor instance, L„ = {0, r, s, 1} with 0 =not conhdent at 
all, r=not very conhdent, . . . , II =very conhdent. The levels of importance of 
criteria are denoted Pi, i = 1, . . . , m, and the /3j ’s also belong to an ordinal scale 
Lfj. Eor instance Lp = {0,e,/, 1} with 0 =not important at all, e=not very 

important, . . . , II =very important. 

In the following, V and A denote max and min on a given ordinal scale. ~^x 
for any * in a given ordinal scale L = {0, si, . . . , s^, 1} denotes the value corre- 
sponding to the reversed scale (i.e. -i0 = Il,“'Si = Sfe_j-|_i). It is the counterpart 
to 1 — * on the [0, 1] difference scale. 



3 Data of the example 



Eor illustrative purpose we use the following example with m = 6 criteria and 
n = 4 experts. Imagine that in some company, a new collaborator K has to be 
hired for the marketing department. Six criteria are used for assessing his qual- 
ihcations: analysis capacities (Ana), learning capacities (Lear), past experience 
(Exp), communication skills (Com), decision-making capacities (Dec) and cre- 
ativity (Crea). The four experts are the directors of Marketing (Mkt), Einancial 
(Ein), Production (Prod), and Human Resources (HR) departments. The data 
are summarized in Tables 1, 2(a), 2(b) and 3. 
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Fie. 1. Assessment by each director on each criteria of candidate K using Ls = 
{ 1 , 2 , 3 , 4 , 5 }. 
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(a) DM’s conhdence in the 
director’s opinions 
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(b) Importance of criteria 



Fig. 2. 

4 The proposed approach 

The problem raises three main questions: i) the representation of the precise 
or imprecise score of the candidate provided by each expert for each criterion, 
including the expert’s confidence in his assessment; ii) the fusion of expert opin- 
ions; iii) the multiple criteria aggregation. The first step is easily handled in the 
possibilistic frameyvork. 

The possibility distribution of the true value of each score on criterion i ac- 
cording to expert j, taking into account his competence, is computed in the 
folloyving yvay. Interval- valued scores (including single values) are modelled by a 
possibility distribution taking the value II in the interval and 0 outside. Blanks 
(absence of ansyvers) are interpreted as a possibility distribution being II every- 
yvhere (modelling ’unknoyvn’). The confidence level 'jij is taken into account by 
a discounting process, defined as (tt denotes the original possibility distribution 
function (p.d.f.) 

( 1 ) 



(g) = (g) V Vg £ Ls 
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Fig. 3. Competence of directors for each criteria 
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Note that the certainty level ')ij G is turned into a possibility level over 
the scores not compatible with . Thus is L-^ reversed (usual equivalence 
between certainty of A and possibility of not A). This leads to Table 4 made of 
p.d.f’s. In each box the p.d.f is enumerated on Lg. 
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Fig. 4. Opinions of directors on each criterion 

As already said, the global assessment requires two types of combination: 
a multicriteria aggregation problem, and the fusion of expert evaluations. So, 
depending on the way the problem is presented, we may either think of i)hrst 
computing the global evaluation of K according to each expert and then to 
fuse these evaluations into a unique one, or ii)on the contrary, hrst fuse the 
expert evaluations for each criterion, and then aggregate the “global” results 
pertaining to each criterion. In general, the two procedures are not equivalent 
(i.e., expert opinion fusion and multiple criteria aggregation do not commute). 
So it is important to understand what is the meaningful order between the 
fusion and aggregation operations, or if this remains unclear, to choose fusion 
and aggregation modes which commute. 

At this point, it is worth emphasizing that expert opinion fusion and multi- 
criteria aggregation are two operations which do not convey the same intended 
semantics. The fusion of expert opinions aims at hnding out what are the possi- 
ble values of the genuine score of K for a given criterion, and possibly to detect 
conflicts between experts. Hopefully, some consensus should be reached at least 
on values which are excluded as possible values of the score. The aggregation of 
multiple criteria evaluations aims at assessing the global worth of the candidate 
from his scores on the different criteria; then different aggregation attitudes may 
be considered, e.g., conjunctive ones where each criterion is viewed as a con- 
straint to satisfy to some extent, or compensatory ones where trade-offs are 
allowed. 

In the following, we choose to merge expert’s opinions on each criterion hrst, 
and then to perform a multicriteria aggregation, since it might seem more natural 
to use the experts hrst to properly assess the score according to each criterion. 
Proceeding in the other way would assume that each expert is looking for a 
global evaluation of the candidate (may be using his own criteria aggregation 
attitude) and the decision maker is only there for combining and weighting 
expert’s evaluations. 

Since we hrst merge the p.d.f’s, one of the allowable method is to use a 
weighted maximum. Indeed disjunctive combination is advisable in case of con- 
hicting opinions [5], as it is the case here (conjunctive fusion can be applied 
only if there is no conhict). A disjunctive combination means that the opinion 
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of an important expert will be taken into account, even if this can conflict with 
another important one. This is dehned as: 

= V [“'b' > VseTs, Vi=l,...,6 (2) 

3 

where Wij is a weight dehned on , representing the conhdence in the opinion 
of expert j on criterion i. Obviously, Wij should be here the conjunction of 
(expert’s own conhdence) with Uj (DM’s conhdence in the expert). Thus, 
Wij = (8) Uj where Cg) is a conjunction operator from x La to L^. Table 

5 dehnes Cg) on the basis of an implicit commensurateness hypothesis of and 
La- Table 6 gives the weights for all criteria and experts. We apply formula (2) 
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Fig. 5. Dehnition of w^j 
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Fig. 6. Weights w^j for all criteria and directors 
to p.d.f’s of Table 4, and obtain the p.d.f’s of Table 7(a). 
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Fig. 7. 

We notice that some distributions are no longer normalized(i.e., no score is 
fully possible at level 1). This is because the weights are not normalized, i.e. 
y jWij < 1 for some i’s, which means that the DM cannot be fully conhdent in 
any director when assessing some criteria (namely Ana, Lear and Dec). 
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We should either use the unnormalized distributions to fully keep track of 
the problem, or normalize them in a suitable way. Here, we choose the second 
solution, and propose the following approach. We consider that the maximum 
of the distribution, denoted /i, reflects the uncertainty level, considered to be 
-i/i in the information. This means that the amount of conflict is changed into 
a level of uncertainty and the minimum level of the modihed distributions will 
be -ih. Then, we make an additional hypothesis that the scale is a difference 
scale (which is questionable!) so that the prohle of the distribution is conserved. 
Specihcally (tt denotes the old distribution): 

7Ti(s) = 7Ti(s) + -.(Vs7Ti(s)) (3) 



It means that + in (3) is dehned by s,- + Sj = Smin(i,i+j) on a scale {sq = 
0, si, . . . , Sfe, Sfe+i = 1}. The result is shown in Table 7(b). 

Concerning the merging of expert opinions, it is not clear if their “weight” 
is absolute or relative. If they are relative, we may think of using a “nonmono- 
tonic disjunction”: a disjunction in order to keep all the information (and avoid 
conflicts) , “nonmonotonic” for discounting the part of the information provided 
by less important experts which is in conflict with what is provided by the more 
important ones; see [5] for details. If the weights, say aj, are absolute as here, we 
may use a weighted disjunction of the form Vj-Oj A Tr^c{K)j where tt^c(k) i® 
result of the multiple criteria aggregation performed for each expert j. However, 
as already said, we may hrst perform the expert opinion fusion, in such a case, 
we compute for each criterion i, W jCtj ATr^c,{K)- Then it appears that the weight 
of expert j for criterion i should be upper bounded by his competence, i.e., aj 
should be changed into aj A 'jij in the previous expression. 

The way of aggregating the criteria evaluations is not at all specihed in the 
statement of the problem. Only qualitative levels of importance [3i are provided 
for each criterion i. Even with ordinal scales, different attitudes can be thought 
of. The aggregation may be purely conjunctive (based on “min” operation), or 
somewhat compensatory (using a median operation for instance). It might be 
also disjunctive: at least on important criteria is to be satished, it is modelled by 
a weighted maximum. More general aggregation attitudes can be captured by 
Sugeno’s integral [14, 13]. Let us assume that the DM has a conjunctive attitude. 

Then there are still different possible ways of understanding the weighting of 
the criteria, even in a qualitative conjunctive setting. Let Ci{K) the supposedly 
precise value of score of K , according to criteria i. Then, we can i) modify Ci{K) 

11 if Ci{K)>(3i 
Ci{K) if Ci{K) < (3i 

[j3i is interpreted as a threshold level to reach); iii) or we may even think of ~<j3i 
as a “bonus” to be added to Ci{K), which requires a richer scale. Then Ci{K) 
would be modihed into -i/3j- + Ci{K). Once the type of weighted conjunction has 
been chosen, it has to be extended to the p.d.f.’s tt,-, since the precise value of 
Ci{K) is not available. 

The “fuzzy” evaluation provided by the p.d.f.’s can be handled in different 
ways. A natural manner to proceed is to extend multicriteria aggregation tech- 
niques to such non-scalar evaluations (this can be done both if this step is done 



into max(cj(/L), -i/3j) (discounting); ii) modify Ci{K) into 
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before or after the fusion step). At the technical level, this will be done using an 
extension principle [15] which enables to extend any function/operation / to any 
fuzzy arguments. Namely, f{Pi,. . . , Pm){s) = A. . .APm{s"") 

where Pi(s^) is the possibility degree of score s® according to p.d.f. Pi. Then the 
result will be a possibility distribution restricting the possible values of the 
global evaluation of K . Then a fuzzy ranking method should be used for assess- 
ing to what extent it is certain, and to what extent it is possible that K\ is 
better than K 2 , on the basis of and in case of several candidates [3] . 

We use the weighted minimum method in what follows for the aggregation. 
With our notations, in the case of precise scores, the global score is obtained by 

6 

*■=1 

where V is a disjunctive operator from LpX Lg to Lg. Table 8 gives the dehnition 
of this operator. Since scores of criteria are imprecise, the extension principle has 
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Fig. 8. Definition of V 
to be used. We have for any ,s ^ Lg 

tt(s) = \J (tti(ci) A ■ ■ ■ A TTe(ce)) (5) 

Applying this formula to our data, we obtain as final distribution the one given 
in Table 9. The result can be interpreted by saying that the candidate K is 
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Fig. 9. Possibility distribution tt of the final score of K 
certainly not a very good candidate, but K is most likely between bad and 
good. The imprecision of the result comes from the fact that 3 criteria among 6 
cannot be precisely assessed by directors, and that some opinions are divergent. 
Also, the weighted minimum procedure assign a weak score if only one important 
criterion is not satisfied. 

Note that if weighted min combinations are used both in the expert opinion 
fusion and in the multi-criteria aggregation then the two combinations commute. 

The assessment problem can be viewed as a decision to be made under un- 
certainty. Namely, the relevance of the criteria defines a candidate profile, i.e., 
a kind of utility function, while the value of the candidate K is ill-known with 
respect to each criterion (once expert opinions have been fused). Viewing the 
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problem in this way, it is natural to compute to what extent it is certain (or 
it is possible) that the candidate K (whose level assessment may be pervaded 
with imprecision and uncertainty for each criteria) satishes the criteria at the 
required level; see [6] for an axiomatic view of the corresponding qualitative 
decision procedure. It corresponds to a fuzzy pattern matching problem [10] in 
practice. 

For each criterion i, we build a satisfaction prohle e.g., /Uj(s) = ~'l3i V s. 
fii means that the greater the score, the better the candidate, and that the 
satisfaction degree is lower bounded by ~<j3i which is all the greater as i is less 
important. Then the certainty that K satishes the prohle is given by 

m{s)y ^TTc^(K){s) ( 6 ) 

where tTc,k is supposed to be normalized. The possibility that K satishes the 
prohle is 

s ^^i{s) ^^Tc,{K){s) (7) 

where h^s been obtained by fusing the expert opinions hrst, at the level 

of each criterion. The possibility degree (very optimistic) should be only used for 
breaking ties in case of equality of the certainty degrees for different candidates. 
The aggregation by A,- of the elementary certainty and possibility degrees can be 
justihed in the possibility framework (dehnition of a join possibility distribution 
of non-interactive variables [15], and interpretation of the global requirement in 
terms of a weighted conjunction of elementary requirements pertaining to each 
criterion). The expression (6) and (7) can be viewed as possibilistic “expecta- 
tion” . 

The result is then not so different from the spirit of the approach detailed 
before. Instead of obtaining a possibility distribution we obtain two scalar evalua- 
tions for which it should be possible to show that they summarize this possibility 
distribution. Anyway both approaches hrst compute the and are based 

on the choice of a weighted conjunction. 

5 Concluding remarks 

In this paper, we have taken advantage of a generic value assessment problem for 
discussing the different facets of the problem and raising the various difficulties 
and hypotheses which should be made at each step for computing a meaningful 
evaluation, as illustrated by the example. 

Although, the statement of the problem contains no number, but only ordinal 
assessments, a solution has been provided owing to the qualitative framework of 
possibility theory. However note that it is important to consider the nature of 
the scales which are used in order to know what operations are meaningful on 
them. Moreover commensurateness hypotheses are necessary. 

One important point in practice for this type of problem is also to be able to 
provide explanations to the user on the result. The qualitative framework of pos- 
sibility theory allows for a logical reading and processing of the evaluations. This 




146 



Didier Dubois, Michel Grabisch, and Henri Prade 



offers a potential for explanation capabilities. Examples of logical machineries 
handling such evaluations have been already proposed for fusion and decision 
purposes [1,9]. 
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Abstract. This paper proposes a method to learn from a set of examples 
a theory expressed in default logic, more precisely in Lukaszewicz’ default 
logic. The main characteristic of our method is to deal with theories 
where the definitions of a predicate p and definitions for its negation -ip 
are explicitly and simultaneously learned. This method relies on classical 
generalization techniques proposed in the field of Inductive Logic Pro- 
gramming and on the notion of credulous/skeptical theorem in Default 
Logic. 



1 Introduction 

The first motivation of this work is based on two remarks : Machine Learning 
is typically a non monotonic process and Default Logic is a powerful non mono- 
tonic formalism intended to manage incomplete information. Moreover, we may 
have to induce theories that are inherently not completely specified and that 
must be represented in a formalism that enables non monotonic reasoning. So, 
it seems interesting to relate these two research domains. This paper addresses 
the problem of learning a default theory from a set of examples and an initial 
background knowledge. We are interested by explicitly learning the positive def- 
inition of a concept and its negative definition. To deal with negation, we want 
to avoid using the assumption everything that is not known is false” (see sec- 
tion 3) . We think that treatment of negation and exceptions is a very important 
task for a machine learning system since if one have something to learn, then 
one is faced to incomplete information and not completely specified knowledge . 

The first two sections recall the major characteristics of default logic (sec- 
tion 2) and machine learning especially in an inductive logic programming frame- 
work (section 3) . The core of our work is exposed in section 4 where our proposi- 
tion to do machine learning in default logic is formally presented. In this section 
we give also an algorithm that permits to effectively achieve our goal. Two exam- 
ples will be presented that show that our approach is well suited for hierarchical 
defaults and also for more general defaults. At last, section 5 compares our 
method to other works that propose to learn extended logic programs and gives 
some future lines to explore. 
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2 Default logic 

Default logic has been introduced by Reiter [22] in order to formalize common 
sense reasoning from incomplete information. Formally, it uses a set W of first 
order formulas representing the sure knowledge, and a set D of defaults repre- 
senting some rules that are not completely specified. A default is an inference 
rule providing conclusions relying upon given, as well as absent information. In 
this work we are only using defaults like <5 = meaning as usual “if the 

prerequisite a is proved, and if the justification /3 is individually consistent (in 
other words if nothing proves its negation) then one concludes the consequent 
7”^. The reader that is not familiar with default logic will find in [3] and [24] 
many other complements about it. In this work we use the variant of default logic 
established by Lukaszewic [14], also named justified default logic. We have chosen 
justified default logic because it solves the problem of possible non-existence of 
extension in Reiter’s default logic and because it satisfies the property of senii- 
monotonicity^ allowing us to do default proofs in an easy way. Last, but not 
least, this default logic is one of the variants that stays the closest to Reiter’s 
work. In this framework, an extension of a theory (W, D) is a pair of maximal 
sets of plausible conclusions and justifications obtained from (LF, D) according 
to the formal definition below. 

Definition 1. Let (LF, D) be a default theory and let E and J be sets of formulas. 
Define Eo = W , Jo = 0 and for i > 0 

Ei+i = Th{Ei) U {7 I 6 D, a 6 Si, Vj 6 J U {/!}, £ U {7} ^ -j} 

Ji+i = Ji u[p\^^^D,a^Ei,yj^JU{P},EU{'i}\/^j] 

Then, {E,J) is an extension of{W,D) iff {E,J) = ([J^q i?*, [J^q J*). 

In the sequel, since we are not interested by the resulting set of justifications 
J, we shall use extension to design the set of conclusions E. 

In whole generality, prerequisite, justification and consequent of a default 
can be any first order formula. If one of them contains free variables, the de- 
fault is called open and, according to Reiter [22] it represents the set of all 
closed defaults (without any free variables) that we obtain by instantiating each 
free variable with all the constants of the domain. Some works ([20,1,18,6]) 
have studied open default theories, which is a very difficult problem, especially 
in presence of function symbols or existential quantifiers. That is why in this 
work, we are using in IF only universally quantified formulas without function 
symbol. As an example, the statements “students are typically not employed”, 
“students are typically adults” and “adults are typically employed” can be rep- 
resented by the set of defaults D = { ■> ■> 

^ If is a default rule, Prereq{ 5 ), Justif{ 5 ) and Conseq{ 5 ) respectively denotes the 
prerequisite, the justification and the consequent of <i. 

^ A default logic is semi-monotonic if for a set of formulas IF and two sets of defaults 
D and D' such that D C D' if i 5 is an extension of (IF, D), then there is another 
extension E' of (IF, D') such that E C E' . 
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suppose that W = {s(bob), a(kate)} then (W,B) has two distinct extensions 
El = Th(W U {-ie(bob), a(bob), e(kate)}) and E 2 = Th(W U {e(bob), a(bob), e(kate)}) 
representing the two possible points of view for bob in absence of other informa- 
tion. 

Default logic is not only concerned by extensions but also by theorems and 
proof theory. Since a default theory may have multiple extensions we have to 
precise the notion of theorem. Given a default theory (W,D), a formula (p is 
a credulous theorem if it belongs to at least one extension of iW,D), and it is 
a skeptical theorem if it belongs to every extension of (W,D). In our example, 
e{kate) is a skeptical theorem and e{bob) is a credulous one. 

Definition 2. Let {W, D) be a default theory and Lp a formula. The sequence of 
defaults {Si, , Sn) is a default proof of Lp in (W, D) iff 

— W U Conseq{{Si , . . . , <in}) <p 

— W U Conseq{{6i, . . . , b Prereq{5i) 

— ID U Conseq{{5i,. . . 6 Just{{5i,. . . ,<!«}) 

It has been shown in [14] that every credulous theorem has a default proof, and 
some operational systems ([5, 25]) are available to do query-answering. Last, 
some works ([27,26]) are interested in skeptical default proof, that is default 
proof for skeptical theorems. 

Let us close this section by noting that default logic semantics is based on the 
idea that each default that is used to build an extension is a tool to constrain the 
set of models of this extension. Formally, an extension of iW, D) that is based on a 
generating default set Zi (zi is a subset of D, such that E = Th{W U Conseq{A))) 
admits the set of models M such that V<5 € zi, we have Vm € M,m |= Prereq{S)A 
Conseq(S) and 3m € M,m |= Justif(S) . 

3 Inductive learning and non monotonic formalisms 

The learning method that we shall present in section 4 is closely related to 
Inductive Logic Programming (ILP), the domain where machine learning meets 
logic programming. The aim of ILP is to induce a theory expressed by clauses 
from positive and negative examples while using background knowledge. 

A learning paradigm common to most works in ILP to learn the definition of a 
predicate p is the following: Given an initial background knowledge B expressed 
by a logic program, a set E~^ of positive examples and a set E~ of negative 
examples, find a set of clauses El such that Ve € E~^ ,BAH \= p{e) and \feE~,BA 
H p{e). The basic mechanism is to construct clauses that cover the positive 
examples and do not cover any negative example. An example e is said covered 
by a set of clauses H if B A iL |= p{e) . Different methods have been proposed in 
order to construct the set of clauses H (see [16] for a review of algorithms and 
systems for ILP). The problem of learning definitions for p{X) can be considered 
as a search through the space of possible clauses whose head is p{X). Top-down 
methods begin with the most general clause {p{X) : —true.) and iteratively add 
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to the body literals that cover an important number of positive examples and no 
negative examples. The other possibility is to work bottom-up by successively 
generalizing the examples. 

Many systems have proved their efficiency for learning definite logic programs 
(where no form of negation is allowed) [16]. The ILP community has also consid- 
ered the problem of using more expressive formalisms, specially by dealing with 
negation. Learning normal logic programs (where the clause body may contain 
the Negation as Failure operator) is considered in [2, 10, 15]. Normal logic pro- 
grams provide negative information implicitly, because a negative literal cannot 
appear in the head of a clause. This implicit treatment of negation is based on 
the closed world assumption which assumes that everything that is not explicitly 
stated is false. 

It is interesting here to recall the difficulties raised by negation and uncer- 
tain information for machine learning [21]. In machine learning we can learn 
from positive and negative examples a definition for a predicate p. Then we can 
apply the closed world assumption and consider that everything that is not rec- 
ognized by this definition of p is an instance of ^p. This is illustrated in figure 1 
(from [21]). Let us imagine an autonomous agent that, for example, learns from 
experiences when it is relevant to apply a certain action. Positive experiences 
means that it is relevant to apply this action and negative experiences represent 
situations where this action is not appropriate. If this agent learns a description 
of relevant situations for this action and uses the closed world assumption, he 
will deal with the universe of possible situations according to schemas a or b 
in figure 1. Figure l.a represents a cautious approach where the agent learns 






a 

Fig. 1. Three different points of view for learning from positive and negative examples. 



from specific to general and considers as relevant only situations very close from 
the positive examples. So, before he has encountered sufficient experiences, he 
cannot distinguish bad situations from unknown situations (the situations above 
the line). In figure l.b, the agent learns from general to specific and has a more 
hazardous approach: he cannot distinguish a relevant situation from an unknown 
situation. But [21] pointed out that the closed world assumption is not relevant 
in a machine learning situation : “If everything is known, then it is not necessary 
to learn”. During the dynamic process of learning it is more realistic to distin- 
guish clearly what is true, what is false and what is unknown as it is illustrated 
in figure l.c. Moreover, this differentiation is also necessary when we have to deal 
with theories that are inherently incomplete, that means where it is difficult to 
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express a complete specification of a concept. This is the aim of our present work 
in which we want to induce a definition for p and a definition for -<p (conditions 
under which p is false) . 

Of course, this approach does not mean to apply twice the same learning 
algorithm, once to learn p and once to learn -ip, because the definitions for p 
and -<p may overlap on some instances. Thus we must adopt a representation 
language and a semantics that enable to solve such conflicts. We shall discuss in 
section 5 some representations that have been chosen to deal with this problem. 
In this work, the target theory is expressed in default logic and we propose an 
algorithm to learn explicit definitions for p and -ip in this framework. Default 
logic is a powerful language to represent incomplete knowledge, which enables 
our method to obtain compact theories where the relationships between the 
definitions for p and -<p appear clearly. 



4 Learning Lukaszewicz ’Default Theories 

The following definition formally precises the framework of learning a default 
theory for a single concept represented by a predicate p. In this new formulation, 
we adapt the classical definition coming from ILP by taking into account the two 
main features of our work : the use of default logic and the induction of explicit 
definitions for p and for ->p. 

Definition 3 . Given a set of positive examples E~^ = {ei, 62, . . . , e„} of the 
predieate p (p{e) is true for all e € E~^ ), a set of negative examples E~ = 
{e[, 62 , ... , e(„} (-<p{e) is true for all e € E~ ) and an initial eonsistent set of first 
order formulas T eontaining no oeeurrenee ofp, learning a default theory eonsists 
to find a default set D sueh that (Agg£;+p(e)) A (Agg£;--ip(e)) is a skeptieal 
theorem of (T, D). 

We say also that an example e € E~^ (resp.E~ ) is covered by (T,D) if p{e) 
(resp.-<p{e)) is a eredulous theorem of (T,D). 

Let us recall that a default theory may have multiple extensions, mutually 
inconsistent. So, it is not enough to build a default theory covering each ex- 
ample, because it would be possible to obtain one extension where some ex- 
amples would be unqualified. For instance, with = {!}, E~ = {2} and 
T = {bird{l),pen{2),pen{X) bird{X)} we can build the simple default 
set D = I learn the concept “flies”. But 

this theory {T,D) has two extensions E\ = Th{T U {flies{l), flies{2)}) and 
E '2 = Th{T U {f lies (1),^ flies (2)}). We see that Ei does not cover the nega- 
tive example 2. Extension Ei has a set of models, and no one of them contains 
^flies{2) as it is required since 2 is a negative example. 

That is why our definition requires that all the examples are skeptical theo- 
rems of the induced default theory. Therefore it ensures that they all belong to 
every model of every extension. 

Our method is based on a “classical” ILP algorithm named Learn( in : p,£,T, 
out : <p) that induces one definition ip{X) from a set of positive examples £ and 
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a background theory T. We can use for example the relative least generalization 
[19], used in the system Golem [17]. It is beyond the scope of this paper to dis- 
cuss the use of other inductive learning algorithms. Our general default learning 
algorithm DefLearn( in :p, T, i?+, E~ out :D ) where p, T, E~^,E~ are such as 
described in definition 3 is based on two fundamentals steps. 
generalization step : We induce a formula that generalizes some positive 
examples and we build the default ■ Maybe, this rule is too general and 

admits exceptions. In our framework, an exception is an example e € E~ such 
that p{e) has a default proof containing this default. Then we have to modify it 
since we want -'p(e) to be a skeptical theorem. This modification is called spe- 
cialization. The generalization step is repeated until all examples are covered. 
specialization step : We induce a formula ^ that generalizes a set of exceptions 
Exc and we modify the previous rule as follows : , By this way, 

this default is no longer applicable for each e € Exc since such an e satisfies ^ 
and then p(e) is not a credulous theorem. Moreover, in order to cover the excep- 
tions (that are negative examples pour p) we build another default , 

that we shall have to specialize if it admits exceptions on its turn. 

More formally, we give in figure 2 our general algorithm to learn a default 
theory. An important point to note is that our method alternatively learns p and 
-<p. That is why our procedures use a formal parameter q that will be instantiated 
by p or -ip depending on whether we generalize/specialize p or ->p. In fact, to 
specialize a too general rule for p corresponds to learn a general rule for -ip. If 
we try to learn -ip then the roles of the example sets (E~^,E~) are switched. E~ 
becomes the set of positive examples for -ip and E~^ becomes the set of negative 
examples for ->p. 

When all the positive examples are covered, we check whether there are still 
some negative examples not covered by the current theory. And, if necessary, 
we begin to complete the definition of -ip by a similar process. All the new 
defaults that are introduced for -ip are constrained by a justification that is the 
conjunction of all the prerequisites of all defaults concluding p. In fact, here we 
profit from the work done to induce general rules for p because, at the same time, 
these rules characterize the possible exceptions of general rules for -ip. And by 
this way, we are sure that it is not possible to obtain a default proof of p(e) for 
one of the examples e € E~ that we are now treating. 

Example 1. The sets of positive and negative examples are E~^ = {3, 4, 5, 10, 11} 
and E~ = {1, 2, 6, 7, 8, 9}. At the beginning D is empty, the theory^ is 

! pen(l),pen(2),bird(3),bird(4),bird(5),mam(6), mam{7), mam{8), mam{9), 
bat{10), superpen(ll), 

pen{X) — )• bird{X), super pen{X) — )• pen{X),bat{X) — )• mam{X) 
and we begin to learn flies{X). The generalization process returns bird{X) that 
covers the set of examples {3,4,5,11}. So, the default (jj = i® 

added to D. Next we determine the “exceptions to that default”, that means the 
elements of E~ for which flies{X) can be proved. The exceptions are {1, 2}, and 

® pen stands in for penguin and mam for mammal. 
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DefXiearn( in : q,T, E~^ , E out : D) 
begin 

D t- 0 

Justif -t— true 

while some examples in E~^ are not covered do 

E^{e€E+\q{e) is not a credulous theorem in (T, D)} 

Learn(g, E, T, ip) 

r ^(X):q(X) 

•*- q(X) 

D-(- DU{5} 

Eg <— {e € E~\q{e) is a credulous theorem in (T,D)} 

IjEg is the set of exceptions of the rule d 
if Eg- ^ 0 

then // specialization step 

Specialize(g, Eg , T, E~ , E~^) 
endif 
endwhile 

Justif <— A~>pre{S), for all € D s.t. cons{5) = q{X) 
while there is some not covered examples in do 

E <— {e € E~ |-ig(e) is not a credulous theorem of (T, D)} 

Learn(-ig, E, T, ip) 

Remove from Justif = Af=i-iJj(X) all Ji{X) such that there is no e 6 E~^ 
verifying T \= p(e) A Ji(e) 

J-) j-)yj l^ r(X): ^q(X)AJustif j 

endwhile 

end 



Specialize( in : q, 6, Exc,T, E+^,E,p^) 
begin 

while Exc ^ 0 do 

Learn(-ig, Exc, T, pExc) 
jus{5) -t- jus{5) A ^pExc{X) 

, ^Exo(X):^q(X) 

OExc -«(X) 

D <— D U {&Exc\ 

Excexc <— {e € Efpg\-iq{e) is a credulous theorem of (T, D)} 
! fExcExc is the set of exceptions of exceptions 
if Excexc ^ 0 
then 

Specialize(-ig, 5exc, Excexc, T, Efp^,E+pg) 
endif 

Exc -t— Exc \ {e € Exc\T h pExc{e)} 
endwhile 
D-(- DU{5} 

end 



Fig. 2. Default theories learning algorithms. 
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the call to Specialize modifies and build a general default to prove that these 
elements do not fly. In fact, Learn returns the formula pen{X) that generalizes 
these exceptions. So, (ij is transformed into and 

<5° = added to D. Now, we determine that {11} is an exception 

to <5° and we have to specialize S^. The same process of generalization applied 
to {11} leads to the formula superpen (X) and then <5° is modified in <5} = 

pen(X) : ^fliesi^A^superpeniX) ^ .nperpe«(X)^fl^e.(X) ^ 

do not detect exceptions to <5°, consequently the recursive calls to Specialize 
end. As there are some positive examples not covered by our theory {T,D), 
namely {10}, the process of learning defaults for flies goes on. The most specific 
generalization of {10} leads to add to D the default <5° = ’ which 

has no exceptions. Here, all the positive examples for flies can be proved from 
(T, D) and we have to consider the negative examples in order to complete our 
definitions of ^ flies. The examples of E~ not covered by (T,D) are {6,7, 8,9} 
and the call to Learn returns mam{X). For this last step, we have recorded 
in Justif all the prerequisites of all defaults concluding flies{X). Justif is 
used as the justification for the new default that we are building to complete the 
definition of ^flies{X). But we remove from Justif the “sub-justifications” that 
are not relevant for the prerequisite mam{X). It is not necessary, and not desired 
as it will be illustrated in example 2, to keep for instance the justification bird{X) 
since in the set there is no bird that is a mammal. Then we obtain <5° = 
mam(x ) : -^fiies{x)A^bat(x) ^ resume, the default set D that has been learned is 

—ijltes(X) ’ 

bird(X) : flies(X)A—<pen(X) pen(X) : —> flies(X)A—<superpen(X) superpen(X) : flies(X) 
flies(X) ’ ^flies(X) ’ flies(X) 

bat{X) : flies(X) mam{X) : -> flies{X)A-'bat{X) 
flies(X) ’ -< flies(X) 

and the conjunction (Aeg£;+//*es(e)) A (Aeg^;- ~'//*es(e)) is a skeptical theorem 
of (T,D). 

We are now giving some justifications that support the correctness of our 
algorithm. First, if the algorithm Learn is always able to furnish a formula ip{X) 
describing the set of the given examples"^, then it is obvious that the number 
of examples not covered decreases strictly at each time. Thus, DefLearn will 
always terminate. Secondly, our process to construct defaults guarantees that 
for each example e € E~^ (resp e € E~) there exists a default proof of p(e) (resp 
-ip(e)) and no default proof for -ip(e) (resp p{e)). As we have supposed that T is a 
consistent set not containing any occurrence of p it is obvious that (Aeg£;+p(e)) A 
iXeE- -ip(e)) is a credulous theorem of {T,D) because we can group together 
all the individual defaults proofs without creating an inconsistency. Now, let us 
suppose that this formula is not a skeptical theorem. It means that there exists 
an extension in which there is no default proof for a particular p{e),e € E~^ 
(or -ip(e),e € E~). The only way to have this situation is to block all possible 
default proofs of one p{e),e € E~^ (or -ip(e),e € E~). That is only possible 
by a default proof for -ip(e),e € E~^ (or p{e),e € E~) since it is impossible 
to have a default proof for ^ip{e) (the negation of one term different of p{e) 

See the section 5 for some precisions about the case where Learn fails. 




4 




156 



Beatrice Duval and Pascal Nicolas 



in the justification) because defaults conclude only on p{X) or ^p(X). But as 
we have said above, our algorithm stops when there is no default proof for 
-ip(e), e e E~^ and for p{e),e € E~ . So, we have a contradiction and the formula 
(Aeg£;+p(e)) A (Aeg£;--ip(e)) is a skeptical theorem of the learned theory. 

Example 2. We give here two versions of a well known example in order to 
illustrate that it is possible that a learned default theory has multiple exten- 
sions. Firstly, let us suppose that we want to learn the predicate p with T = 
{q{bob),q{nixon),r{georges),r{nixon)}^ , E~^ = {bob} and E~ = {georges} . 
DefLearn will produce the default set D = { 1- 
required p{bob) A ^p(georges) is a skeptical theorem of (T, D) even if (T, D) 
has two distinct extensions Ei = Th{T U {p{bob),^p{georges),p{nixon)}) and 
£’2 = Th{Tu{p{bob),^p{georges),^p{nixon)}). But that is not surprising, since 
nixon is not a classified example. 

Secondly, if we consider £+ = {bob, nixon} and E~ = {georges} then our 
algorithm produces D = { )• case, there is 

only one extension in which nixon is pacifist accordingly to the examples. If 
we begin the treatment of this example by learning ^p{X), the second part of 
the algorithm, which has to learns p{X), has nothing to do since all positive 
examples are already covered by the default built while learning the exceptions 
to -ip(W). 



5 Related works and perspectives 

The main contribution of our work is to give a framework and a method to learn 
a concept and its negation while easily dealing with exceptions of general rules. 
In this section, we examine other works in the same field. 

In [21] a concept and its negation are effectively learned, but in the framework 
of definite clauses : the negative concept is represented by a new predicate not-p 
and the learning algorithm checks that no contradiction occurs between the 
definitions of p and notjp. 

The framework proposed in [8] is able to learn a concept and its exceptions 
by means of general rules like 
flies{X) •«— bird{X) 

-iflies{X) penguin{X) 
flies{X) •«— superpenguin{X) 

It uses some additional priority relations in order to solve conflict between rules. 
For instance, in the above example, the third rule is fixed with a higher priority 
than the second one, that is itself fixed more prior than the first one. In fact, 
using these relations and an adequate semantics, the authors capture the notion 
of specificity of a rule as it is done in [4] in prioritized default logic. But, it is 
known ([23, 9]) that specificity can be handled by means of semi-normal defaults® 
and that is exactly what our method does. 

® r stands for republican, q for quaker and p for pacifist. 

® A default is semi-normal if it is like 

7 
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In [7] the problem of contradiction between definition of p and ^p is solved 
by using integrity constraints in order to restrict the conclusions derivable from 
too general rules. 

More recently, some works deal with this problem in the context of extended 
logic programs [12, 13]. Extended Logic Programs (ELP) have been introduced 
by Gelfond and Lifschitz [11] to extend the class of normal logic programs by 
allowing explicit negation. A rule in an ELP has the form Lq t— Li, . . . , L^, not 
Lm+i, ■ ■ ■ ,not L„, where each Lj is a literal (positive or negative). [12, 13] pro- 
pose methods to learn an ELP that contains a definition of p and a definition 
of -ip. Each definition may have exceptions that are described by abnormality 
predicates, and these abnormality predicates are defined by normal clauses. So, 
the aim of these works is the same of ours. The main difference is that we do not 
rely on abnormality predicates to specialize overgeneral rules. For instance, the 
algorithm presented in [13] learns rules for p and specialize them if they have 
exceptions, then it computes on the same manner a set of rules for -ip. For our 
example 1, the following rules 

flies{X) : —bird{X),not abl{X). -iflies{X) : —mam{X),not ab3{X). 

abl{X) : —pen{X),not ab2{X). ^flies{X) : —pen{X),not ab4{X). 

ab2(X) : — super pen{X). ab4{X) : — super pen{X). 

flies(X) : -bat(X) ab3(X) : -bat(X). 

are learned. We can observe that the algorithm has dealt twice with the set of 
penguins, once when penguins are considered as a characterization of abl and 
another time when penguins are considered as examples of the concept ^ flies. 
This illustrates that using abnormality predicates to specialize rules hides the 
deep relationships that exist between definitions of p and -ip and this leads to 
redundancy in the resulting rules. We have not given here the complete induced 
theory where in fact the first rule is transformed in the two rules 
flies{X) : —bird{X),not abl{X),not ->flies{X). 
flies{X) : —bird{X),undefined{-iflies{X)). 

and similarly for the other rules concluding flies{X) or ^flies{X). The well 
founded semantics requires these modifications in order to deal correctly with 
the examples where the definitions of flies and ^flies overlap. 

The algorithm presented in [12] relies on the ratio of positive examples to 
decide whether it should learn p or -ip. When the number of positive and negative 
examples are close, the algorithm learns in parallel rules for p and rules for-ip, as 
it is the case in [13]. Our method that learns alternatively general definitions for 
p and -ip gives a more complex algorithm that tries to avoid several executions 
of the Learn procedure on the same set of data. 

The work presented in this paper must be continued in several directions. 
Our algorithm could be modified to take into account some particular examples 
that cannot be generalized. For instance, if in example 1 we add bird{12) in T 
and 12 in E~ , then 12 is an exception to default that cannot be described by a 
formula different of bird{X). So it is impossible to specialize the default theory 
to correctly cover this example. In this case, a solution is simply to add the 
example ^flies{12) as a fact in T in the resulting theory. Therefore default 
is blocked and -i/Les(12) is a skeptical theorem as it is required. Also, we have 
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considered here that the initial theory does not contain defaults. The framework 
presented, as well as the algorithm, can be extended to learn a single concept 
represented by a predicate p while using an initial theory containing defaults 
that does not define p. As our notion of covering is already based on default 
proofs, this does not involve major changes. 

6 Conclusion. 

We have presented a method to induce Lukaszewicz’ default theories from a set of 
positive and negative examples. Our algorithm uses a classical ILP algorithm in 
order to propose formulas that generalize a set of instances. Our method learns 
at the same time definitions for a concept and its negation by alterning gen- 
eralization steps and specialization steps. The mechanism proposed here could 
be integrated in a system that helps a user to formalize a domain knowledge 
in a default theory. Works in machine learning and knowledge acquisition have 
shown that, for an expert, it is easier for an expert to propose examples of some- 
thing instead of directly writing rules. The framework we have proposed may 
be helpful to detect exceptions to rules and to handle and correct contradictory 
rules. 
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Abstract. Traditionally, inductive learning algorithms such as decision 
tree learners have employed attribute- value representations, which are es- 
sentially propositional. While learning in first-order logic has been stud- 
ied for almost 20 years, this has mostly resulted in completely new learn- 
ing algorithms rather than first-order upgrades of propositional learning 
algorithms. To re-establish the link between propositional and first-order 
learning, we have to focus on individual-centered representations. This 
short paper is devoted to the nature of first-order individual-centered 
representations for inductive learning. I discuss three possible perspec- 
tives: representing individuals as Herbrand interpretations, representing 
datasets as an individual-centered database, and representing individuals 
as terms. 



1 Introduction 

Inductive learning can be loosely defined as learning general rules from speeifie 
examples. In concept learning, in particular, examples are descriptions of in- 
stances and non-instances of the concept to be learned, and induced rules take 
the form of classification rules. Here, one could say that induetion generalises 
from statements about individuals to statements about sets of individuals. Indi- 
viduals are the units over which generalisation takes place, so rules typically 
do not contain references to individuals. Other inductive learning tasks, such as 
program synthesis from examples or knowledge discovery in databases, assume 
a format where the notion of individual plays a less prominent role. 

It is well-understood that knowledge representation is of crucial importance 
to machine learning: what cannot be expressed easily cannot be learned easily. 
Initially, work on inductive concept learning focused almost exclusively on so- 
called attribute-value (AV) representations, where an attribute is a function that 
assigns to each individual a value from a pre-specified domain. As explained be- 
low, in a certain sense attribute-value representations can be called propositional. 
With Prolog’s rising popularity in the early 1980’s, people started investigating 
the use of first-order definite clause logic for inductive learning, leading to the 
establishment of what is now called induetive logie programming (ILP). 
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As first-order logic, by itself, does not enforce a notion of individual, this 
development has led to a considerable gap between propositional concept learn- 
ing and ILP. Recently, researchers have started to fill this gap by investigating 
first-order individual-centered representations. This is useful, because it demon- 
strates how ILP extends propositional concept learning, and characterises the 
added complexity. It also gives a clear perspective on how propositional concept 
learning algorithms can be upgraded to the first-order case. 

In this short paper I discuss the nature of first-order individual-centered rep- 
resentations for inductive learning. The paper is structured as follows. In Section 
2 I discuss attribute-value learning, and explain why it is often called proposi- 
tional learning. Section 3 presents three ways of generalising this to first-order 
(and possibly higher-order) learning: by representing individuals as Herbrand in- 
terpretations, by representing datasets as an individual-centered database, and 
by representing individuals as terms. Section 4 concludes. 

2 Propositional Learning 

Intuitively, an attribute-value dataset (i.e. a set of classified examples) is given 
by a table, the columns of which are attributes, and the rows are descriptions of 
instances in terms of those attributes. One attribute is designated as the class at- 
tribute, and the learning task is to construct a classifier that can predict the value 
of the class attribute from the values of the other attributes. In symbolic con- 
cept learning predictions are derived from explicitly generated classification rules 
of the form IF shape = round AND size = medium THEN class = football 
(following logic programming terminology the IF-part is also called the head, 
and the THEN-part is called the body). First I give a brief formalisation of this 
formalism, then discuss it from the knowledge representation perspective. 

An AV-signature is a finite set of attributes {Ai, . . . , A„}. Associated with 
each attribute Ai is a set of values {an, . . . , aik^}. An attribute-value literal or 
AV-literal is an expression of the form Ai = aij, where Ai is an attribute and 
Gij is an associated value. An attribute-value eonjunetion or AV-conjunction is 
a conjunction of AV-literals such that each attribute occurs at most once; the 
AV-conjunction is eomplete if each attribute occurs exactly once. 

Attribute-value literals are the building blocks of attribute-value representa- 
tions, similar to literals in logic. They are used both in descriptions of individuals, 
and in classification rules, as follows. An attribute-value learning task or AVL- 
task is characterised by an AV-signature plus an additional elass attribute C 
with associated class values. An example is a complete AV-conjunction together 
with a class literal. Classification rules are if-then rules where the if-part or body 
is a (possibly incomplete) AV-conjunction, and the then-part or head is a class 
literal.^ 

^ For simplicity, we do not deal with incomplete descriptions of individuals, nor with 
other connectives than conjunction in the body of rules. Notice, however, that there 
may be several rules with the same class literal in the head, which allows for a limited 
form of disjunction. 
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The reader will have noticed that AV-conjunctions are used in bodies of 
classification rules as well as in examples. This may seem awkward, as they 
have a different semantics: an example refers to a single individual, while the 
body of a classification rule refers to a set of individuals. However, we can bring 
the two on equal footing by noting that complete AV-conjunctions in fact refer 
to equivalence classes of indiscernible objects (the same principle underlying 
rough sets). Consequently, the question ‘does this rule cover this example’ is 
reduced to the question ‘does this AV-conjunction subsume that one’. Using the 
same formalism to represent individuals as well as sets of individuals has been 
termed the Single Representation Trick (SRT). Since the distinction between 
individuals and sets of individuals vanishes under the SRT, there is no need to 
use constants or ground terms to refer to individuals and free variables to refer 
to sets of individuals. Without terms and variables the logic essentially becomes 
propositional. 

The above attribute- value formalism is universally accepted in machine learn- 
ing. However, from a knowledge representation perspective the formalism may 
seem a bit awkward, and in particular the SRT appears to be more an implemen- 
tation detail than anything else. So let’s consider briefly how a logician would 
represent a concept learning task. Clearly, an attribute is a function mapping an 
individual to a value, e.g. length (John) = tall would be a literal occurring in 
an example, and length (X) = tall would occur in rules. ^ The resulting logic 
would be something like first-order logic with unary functions, equality, and no 
predicates. Notice that, if rules are not intended to refer to individuals, we can 
further dispense with naming individuals by specifying the values of attributes 
on the term level rather than the literal level. This will be further elaborated in 
Section 3.3. 

3 How to Generalise Propositional Learning 

First-order concept learning is inductive logic programming restricted to an 
individual-centered representation. In this section we discuss three such repre- 
sentations: individuals represented as Herbrand interpretations, datasets repre- 
sented as an individual-centered database, and individuals represented as terms. 

3.1 Prom Truthvalue Assignments to First-Order Interpretations 

If we assume for the moment that all attributes are boolean, the dataset becomes 
a truthtable, each row of which is a truth- value assignment representing an in- 
dividual. The learning task is then equivalent to synthesising a boolean function 
from satisfying and falsifying truthvalue assignments, i.e. models and counter- 
models (depending on the class- value) . Analogously, a first-order dataset can be 
defined as a set of first-order interpretations, each interpretation representing a 
distinct instance or counter-instance — for simplicity, we restrict attention to 
Herbrand interpretations. 

^ I adopt the Prolog conventions for constants and variables. 
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A Herbrand interpretation assigns a truthvalue to each ground literal in the 
language, and thus can be tabulated in a (potentially infinite) table. However, 
the switch to first-order logic is not just a matter of increasing dimensionality. As 
indicated before, rules will not contain references to individuals or their parts. 
Consequently, the ground terms used to describe individuals and their parts 
are not shared between individuals, and each Herbrand interpretation can be 
restricted to its own Herbrand universe of ground terms. Not only does this yield 
a huge reduction in the representation of the dataset, it also makes the dataset 
non-homogeneous, as the Herbrand universe of one example can be much larger 
than that of another. It does not make sense anymore to think of the dataset as 
one fixed table. 

From this perspective the main innovation of first-order concept learning is 
its ability to represent individuals in a variable format. Notice that by assuming 
that attributes are boolean we lose information: from ‘the colour of x is blue’ 
we cannot anymore infer that a;’s colour isn’t red, unless we apply a special 
mechanism such as the Closed World Assumption, or encode the functionality 
of attributes in background knowledge. The main advantage of this perspective 
is that it is closely connected to Computational Learning Theory, where boolean 
function learning from truthvalue assignments is a frequently studied problem. 
This allows one to obtain learnability results for first-order logic [4]. 



3.2 Prom a Single Table to a Relational Database 

An alternative but related perspective is obtained when we think of the dataset 
as a special case of a relational database, namely a single-table database in which 
each individual is represented by a single tuple. The latter assumption can be 
encoded by including a unique identifier for each individual, and requiring that 
this identifier be a key of the relation. From this perspective a first-order dataset 
corresponds to a particular kind of multi-relational database which I will call 
an individual- centered database [1]. In such a database there is one designated 
individual relation containing the individual identifier (but not necessarily as 
a key). Every other relation is directly or transitively linked to the individual 
relation via other identifiers.^ 

By virtue of the individual identifier, the database can be partitioned into 
sub-databases each describing a single instance. Such a sub-database can be seen 
as a structured version of the Herbrand interpretations discussed in the previ- 
ous section. Where a Herbrand interpretation is a truthvalue assignment to an 
unstructured set of ground literals, the database perspective has the additional 
advantage that it offers more opportunities for structuring information. Also, 
note that we didn’t need to consider attributes as boolean. On the other hand, 
this perspective restricts us to a language without function symbols. 

® In addition there would also be relations encoding extensional background knowledge 
that is relevant to all individuals. For simplicity we ignore background knowledge in 
this paper. Notice that background knowledge can also be encoded intensionally, in 
which case the database becomes a deductive rather than a relational database. 
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From the database perspective we can be a little bit more precise about what 
distinguishes first-order concept learning from its attribute-value counterpart. 
There are two conditions under which a multi-relational database is essentially 
equivalent to a single-relational one: 

1. the individual identifier and the other identifiers linking relations must be 
keys; 

2. the links between relations form a directed acyclic graph — in particular, 
there are no recursive links, i.e. a tuple cannot be linked to another tuple in 
the same relation. 

First-order concept learning thus occurs when one or both of these conditions is 
falsified. In this respect, it is interesting to note that non-key individual identi- 
fiers have also been discovered in the attribute-value learning community, where 
this single-relation multiple-tuple representation has been baptised a ‘multiple- 
instance’ representation [5]. Thus, the gap between attribute- value learning and 
ILP is being closed from two sides, from the ILP-side by making the represen- 
tation individual-centered, from the AVL-side by making the representation less 
deterministic. 



3.3 Prom Tuples to First-Order Terms (and beyond) 

The third and last perspective discussed in this paper is the one that has been 
developed in Bristol over the last few years [2,6, 7]. While in the previous two 
approaches a row from the attribute-value dataset was interpreted as something 
with a truthvalue (a proposition), in this approach it is viewed as something 
with a denotation (a term). Specifically, an attribute is an enumerated type, 
and an individual is described by an element from the cartesian product of all 
attributes, which is a complex type called a tuple type. 

An example is thus a single literal associating a tuple with a class value. The 
tuple contains all there is to know about the individual concerned, and there is 
no need to introduce an identifier. In order to refer to attribute values in the 
body of a classification rule we need to deconstruct the tuple type by means 
of projection functions, one for each attribute. An attribute-value “literal” in 
a rule, which we call feature in this context, thus consists of two steps: (i) to 
project onto one of the constants in the tuple, and (ii) to test the value of that 
constant. The first step is performed by what we call a struetural funetion, and 
the second step concerns a property of the subterm. The difference between the 
two is that structural functions apply to non-atomic terms, while properties treat 
their argument (s) as atomic. 

In order to generalise this to the first-order case we simply allow a more 
complicated type structure, by 

1. allowing other complex types besides tuple types, and 

2. allowing a more deeply nested type structure. 
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For instance, the type structure could be a set of lists of tuples. Clearly, lists are 
first-order terms, and sets are higher-order terms as they represent the character- 
istic function of the set at the same time. Each complex type has an associated 
operation to decompose a term of that type into subterms. For instance, a list 
can be decomposed into head and tail, and a set can be decomposed into its 
elements. We thus could have the following feature: 3 Y, Z: element (Y,X) A 
head (tail (Y) )=Z A Z=(a,b , c) . Here, element decomposes set X into one of its 
elements Y, head and tail are used to decompose list Y into its second member 
Z, which is then compared to the tuple (a,b,c). 

Notice that such decomposition operations can be non-deterministic (or non- 
determinate, in ILP parlance): it depends on the particular list how often its 
tail can again be decomposed, and it depends on the particular set how many 
elements it has. In fact, set membership is a structural predicate rather than a 
structural function. Likewise, we can use list membership to decompose a list, 
in which case we’re disregarding the order of the list elements but retaining 
their multiplicity (i.e. the list is treated as a bag). This demonstrates that the 
behaviour of a complex type is determined by its associated structural functions or 
predicates. In other words, we don’t need a strongly typed language to implement 
this approach, as long as we recognise the special status of structural predicates 
and functions. 

Paradoxically, we can even dispense with complex terms and use a function- 
free logic. By means of flattening (a very common technique in ILP [8]) we can 
represent function symbols by predicates. This requires introducing identifiers 
for each individual and each of their subterms, which brings the term-inspired 
approach much closer to the previous two. While flattening has a profound effect 
on the representation of the examples, it leaves the representation of rules pretty 
much untouched except for the properties of subterms: whereas in an unfiattened 
representation we can say ‘subterm equals ground complex term’, in a flattened 
representation we have to decompose the subterm down to the atomic level and 
then equate each of its atomic subterms with constants. 



4 Discussion 

From the foregoing discussion it emerges that the essential feature of first-order 
representations in concept learning is not just that they are structured, as is of- 
ten said, but that they allow a free, non-deterministic format. Each of the three 
representations discussed provides this free format in its own way. In particu- 
lar, the datasets-as-databases approach provides a free-format representation of 
individuals by virtue of the use of non-key identifiers, and cyclic links between 
relations. Analogously, the individuals-as-terms approach provides flexibility by 
means of sets (a higher-order type) and recursive datatypes such as lists. 

The question arises to what extent these three representations are equivalent. 
My conjecture is that at least the databases approach and the terms approach 
are similar enough for any dataset in one representation to be translatable into 
the other. In order to convert a terms representation into a database representa- 
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tion we essentially apply the flattening transformation described earlier; in the 
converse direction we ‘unflatten’ in order to get rid of the identifiers. The trans- 
formation from databases to Herbrand interpretations is also straightforward. 
Finally, we note that Herbrand universes and Herbrand bases can be defined as 
(the union of) algebraic datatypes, over which Herbrand interpretations estab- 
lish a set-type — this takes care of the transformation from interpretations to 
terms. 

On the other hand, the individuals-as-interpretations representation is less 
expressive than the other two, in the sense that it allows less structure to be 
stated explicitly. There is no ‘Herbrand equivalent’ to the datamodel of the 
database or the type signature of the strongly typed language. To some extent 
the mode declarations often used in ILP play this role of specifying the structure 
of the domain, but not to the same extent as a datamodel or type signature 
would. 

A very interesting problem concerns transformations from first-order to propo- 
sitional representations. In the case that such a transformation is explicitly ap- 
plied to the dataset it is called ‘propositionalisation’. In the term representa- 
tion, propositionalisation requires to represent e.g. sets over finite universes as 
binary vectors encoding the characteristic function. In the database representa- 
tion propositionalisation requires construction of the universal relation. Clearly, 
under extreme forms of propositionalisation the size of the dataset explodes (see 
[3] for a preliminary analysis). On the other hand, some approaches do not ex- 
plicitly propositionalise the dataset but use similar transformation arguments 
to construct the hypothesis space. For instance, the IBC Bayesian classifier [7] 
uses the sets-to-vectors transformation to define a probability distribution over 
sets, given a distribution over the universe. There seems to be no consensus as 
to whether such transformations keep the first-order nature of a learning system 
intact. 
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Abstract Most approaches to inconsistency handling in stratified propositional 
belief bases (AB) exhibit bad worst case computational complexity results, 
seemingly placing large real-life applications out of reach. One natural way to 
overcome this drawback consists in adopting a trade-off between the solution 
quality and the actual computational resources that are spent. Taking Benferhat 
et al.’s inclusion and lexicographic orderings as a case study, a complete 
revision algorithm is provided in the context of a single kernel assumption. It is 
shown to be a good approximation in the general case. Based on powerful 
heuristics about local search, this technique appears efficient for very large 
KBs, as illustrated by our experimental results. 



1 Introduction 

Handling inconsistent information is the central issue of several correlated domains of 
artificial intelligence, like belief revision and nonmonotonic reasoning. In this paper, 
we focus on the syntax-based revision of inconsistent finite KBs that are equipped 
with a complete pre-ordering between formulas, expressing a preference scale. 
Mainly, the problem is to find maximal consistent sub-bases from KB that 
accommodate both the preference pre-ordering and a maximal cardinality constraint. 
This issue has been thoroughly addressed both from a conceptual point of view (e.g. 
[Benferhat et al. 93, 95]) and from the worst case complexity theory ([Cayrol and 
Lagasquie-Schiex 94] [Nebel 96] [Eiter and Gottlob 92]). Unfortunately, bad worst 
case complexity results seemingly place large real-life applications out of reach. As 
Nebel points it [Nebel 96], several paths could be explored to overcome this 
drawback: the use of approximate techniques and the exploitation of the newly 
discovered efficient local search techniques for propositional consistency [Selman et 
al. 92, 93] [Mazure et al. 97a, 98a]. In this paper, these two paths are followed 
simultaneously, with the goal of reaching actual tractability for very large scale KBs 
as often as possible. 

From a technical point of view, Benferhat et al.’s inclusion and lexicographic 
orderings are selected as a case study [Benferhat et al. 93, 95], We apply them to the 
incremental construction of a consistent clausal belief base: namely, let / be a 
consistent clause and KB be a consistent stratified clausal belief base, how KB’ = KB 
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U { / } should be revised when it is inconsistent? In this paper, an original algorithm 
is proposed to address this issue. On the one hand, it benefits from a non-standard use 
of local search techniques that delivers the right formulas to be dropped (or 
weakened, see [Bessant et al. 98]) to restore consistency, very often. On the other 
hand, we adopt a form of approximation that appears to be a good trade-off between 
the solution quality and the effective computational resources that are spent. More 
precisely, a complete algorithm for revision is provided in the context of a single 
kernel assumption. It is shown that it delivers an acceptable approximation in the 
general case. The algorithm appears very efficient for many very large KBs, as 
illustrated by our experimental results. 

The paper is organized as follows. First, some formal preliminaries are given and 
various maximal consistency sub-bases principles are recalled, together with the 
useful dual concept of minimal inconsistent sub-base. Second, local search techniques 
for propositional consistency are briefly described before it is shown how they are 
(very often) useful for detecting inconsistent information as well. Then the original 
algorithm is proposed. It is studied as a complete algorithm under the single kernel 
assumption before we show that it delivers an acceptable approximation in the general 
case. Finally, some experimental results are given to illustrate the power of the 
approach. 



2 Formal Characterization 



Assume that KB is a consistent finite set of clauses and that / is a consistent clause. 
Assume that the m clauses in KB’ = KB U { / } are partitioned inside n strata 
SlU...USn. When c, G Sit and c. G Sr such that k> r that means that q. is preferred 
over c^, i.e. if we must choose to drop one of the two clauses then we drop c,. 
Accordingly, the strata translate a preference complete pre-ordering on the clauses of 
KB’. For simplicity of presentation, we assume that the m clauses are numbered in a 
way that follows the strata, i.e. when c, G Sk, Cj G Sr and k>r, then i > j. 

Several approaches have been proposed to characterize the preferred maximal 
consistent sub-bases of an inconsistent stratified KB Here we take Benferhat et al. 
orderings as a case study [Benferhat et al. 93], which are correlated to previous work 
by [Brewka 89]. Namely, let A = A1U...UA« and B= BlU...UBn be two consistent 
subsets otKB’, where Ai = A H Si and Bi = B D Si. 

Definition 1. [Benferhat et al. 93] 

^ «(siu...us;,, i„c) B iff 3 i s.t. Ai C Bi and for any j < i, Aj = Bj (C denotes strict 
inclusion). A consistent sub-base of KB’ that is maximal w.r.t. «(siu„.us,,inc) called 
a maximal inclusion-based consistent sub-base of KB’ (w.r.t. SlU. . .US« ). 

An equivalent constructive definition can be established easily. 

Proposition 1. [Benferhat et al. 93] 

A = AlU...UAn is a maximal inclusion-based consistent sub-base of KB’ (w.r.t. 
SlU...USn) iff 
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AlU...UAi is a maximal (w.r.t. set inclusion) consistent sub-base of S = SlU...USi, 
for all j e [I../ 1 ] in increasing order, successively 

Definition 2. [Benferhat et al. 93] 

A «(siu ,.us„.i„)B iff 3 i s.t. |Ai| < |Bi| and for any j < i, |A;'j = jBj'l (where |X| denotes 
the cardinality of the set X), A consistent sub-base of KB’ that is maximal w.r.t. 
■^■^isiu Ui, i„) is called a maximal lexicographic consistent sub-base of KB’ (w.r.t. 
SlU..!uSn). 

Proposition 2. [Benferhat et al. 93] 

A = AlU...UArt is a maximal lexicographic consistent sub-base of KB’ (w.r.t. 
SlU...USn) iff 

AlU...UAi is a maximal (w.r.t. cardinality) consistent sub-base of SlU...USi, for all 
i e [!..«] in increasing order, successively 

For clarity of presentation, we also recall the dual straightforward concept of minimal 
inconsistent kernel. 

Definition 3. 

A minimal inconsistent kernel (in short, kernel) of KB’ is a subset of clauses of KB' 
that is both inconsistent and minimal with respect to set-theoretic inclusion. 

Clearly, KB' might contain several different kernels and their set-theoretic 
intersection contains /since /is a clause. Also, dropping one clause belonging to a 
kernel ol KB’ is enough to break the kernel, i.e. to suppress it from the set of kernels 
oiKB’. 

Definition 4. 

The inconsistent part of KB’, noted IP{KB’), is the set-theoretic union of all kernels 
of KB’. 

Let us now present the basic principles about local search that we shall use. 



3 Local Search and Inconsistency Detecting 

Most local search algorithms (mainly, GSAT and its variants [Selman et al. 92, 93, 
97] [McAllester et al. 97] [Mazure et al. 97a, 98a]) perform a greedy local search for a 
satisfying assignment of a set of propositional clauses. The algorithms generally start 
with a randomly generated truth assignment of the propositional variables (i.e. assigns 
true or false to each propositional variable). Most of them change (« flips ») the 
assignment of the variable that leads to the largest increase in the total number of 
satisfied clauses. Such flips are repeated until either a model is found or a preset 
maximum number of flips is reached. This process is repeated as needed up to a 
preset maximum of times. To escape from local extrema, local search algorithms are 
provided with additional (e.g. random) moves. In the following, a variant procedure, 
called TSAT, that makes use of a tabu list forbidding recurrent flips and that appears 
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competitive in most situations is used [Mazure et al. 97a, 98b]. 

Although local search techniques prove efficient in showing the consistency of hard 
propositional clausal KBs, they are clearly logically mcomplete in the sense that they 
do not cover the whole search space of interpretations and cannot thus directly prove 
that a formula is inconsistent. Recently, it has been discovered that the trace of local 
search algorithms can allow one to determine the smallest inconsistent kernels very 
often when they fail to prove consistency within a preset amount of computing time 
[Mazure et al. 97c, 98aj. More precisely, for each clause, taking each flip as a step of 
time, the number of times during which this clause is falsified is updated. A similar 
trace is also recorded for each literal occurring in KB, counting the number of times it 
has appeared in the falsified clauses. When the local search technique fails to prove 
consistency, it appears extremely often that the most often falsified clauses form a 
superset of the smallest inconsistent kernels of KB if KB is actually inconsistent. 

Let us stress that this experimental result appears helpful in the problem of proving 
inconsistency only when the discovered probable kernels remains of a size that is 
actually tractable for the best complete satisfiability checking techniques. Fortunately, 
in large actual KBs, inconsistency is often due to a limited number of contradictory 
rules and facts. In this case, a call to a local search algorithm followed by a complete 
Davis and Putnam like procedure (in short, DP) [Davis et al. 62] that exploits the trace 
of the failed local search proves efficient very often. Obviously enough, unless P = 
NP, worst cases can always be imagined that make this approach for consistency and 
inconsistency checking exponential in the size of KB'. 

Local search techniques for consistency checking were first designed to address 
the MAXSAT problem, i.e. the optimization problem that consists in finding maximal 
set-inclusion consistent subsets of KB’ [Hansen and Jaumard 1990]. Thus, according 
to Propositions 1 & 2, a natural technique to compute maximal consistent sub-bases 
of KB ’ consists in applying advanced efficient local search techniques for MAXSAT 
(e.g. [Battiti and Protasi 96]) in an iterative way w.r.t. to the successive strata (and, 
when the number of strata is large, even include a dichotomic flavor in a search for 
the largest i among [l..n] s.t. SlU...USi is consistent). One should also note that the 
smallest number of falsified clauses encountered during the failed search for a model 
by a local search algorithm is an upper bound of the minimal number of clauses that 
must be dropped to restore consistency (not taking into account the preference 
priority). 

As the number of strata can be huge, we explore another path here. The main idea 
is to use the trace of a failed local search to find the clauses that (most probably) 
would restore consistency if they were dropped, taking into account the preference 
pre-ordering. But before we describe the algorithm, we distinguish between two 
different possible situations. 



4 Under the Single Kernel Assumption 

In the incremental approach to build a consistent belief base, we can hope that the 
introduction of a new clause / will not conflict in several minimal ways with the 
clauses already in KB. Accordingly, we distinguish between two possible situations 
when KB ’ = KB U { / } is inconsistent. In the first one, which we call the single 
kernel assumption, there is exactly one kernel in KB’ whereas, in the general case. 




172 Eric Gregoire 



there can be several different kernels in KB’. To some extent the single kernel 
assumption is correlated to the single diagnosis situation in consistency-based 
diagnosis [Reiter 87]. Clearly, the single kernel situation will be encountered more 
often in our incremental building of a consistent stratified belief base than in the 
process of building the belief base globally before restoring consistency only once. 

Accordingly, let us first consider that the single kernel assumption applies. In the 
next section, we shall study how the following results must be interpreted otherwise. 

Definition 5. 

The set of highest cancelable clauses oi KB’, noted is the set of clauses that are 
maximal according to the complete pre-ordering < between clauses in 1P(AB’). 

Since clauses are classified according to a complete pre-ordering, a 1. It is 

obvious that „ does not contain any clause c s,t. c < /. Under the single kernel 
assumption, dropping one element of is enough to get a maximal consistent 
inclusion-based or lexicographic consistent sub-base of KB’, which coincide 
necessarily. Finding one element of is however computationally heavy in the 
worst case; it is polynomial under a number of calls to an NP-oracle that is 
logarithmic with respect to the number of strata. More detailed worst case results can 
be found in [Cayrol and Lagasquie-Schiex 94]. 

Proposition 3. 

Finding one element of is in FP'^'°^* where rt is the total number of strata. 

The following original procedure allows us to find a maximal consistent inclusion- 
based or lexicographic consistent sub-base of KB ’. 

Interestingly enough, this is done very efficiently for most belief bases, reducing 
the logarithmic number of calls to an NP-oracle to only 3 calls to a fast satisfiability 
check very often. 

Procedure 1. 

1. A local search procedure is run on KB’. If it fails to prove the consistency of KB' 
after a preset computing time, a complete DP-based search is performed, 
focusing on the trace of the failed local search. When KB’ is inconsistent, the 
trace of the search delivers a list L of clauses c that are such that c a / and that 
are sorted in decreasing order w.r.t. the number of times they have been falsified 
during the search. 

2. L is re-organized in such a way that a clause that belongs to the most often 
falsified ones and that exhibits the highest stratum among these latter ones 
appears first. 

3. Under the single kernel assumption, the first clause c in the list L belongs to 

most probably. Accordingly, we check the consistency of KB’ \ {c} and 
retract c from L. 

(a) If KB ' \ {c} is consistent, we have to make sure that a similar result cannot 
be obtained using a clause c’ s.t. c’ > c. To this end, we run a complete 
satisfiability check on KB’ \ {c” s.t. c” > c and c”e L} U {c”v mark_c” 
s.t. c” > c and c”E L and markjc” are new literals, each of them being 
associated with a given c”}. 
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-When no model is found, we have proved that KB‘ \ { c } is one 
maximal inclusion-based consistent sub-base of KB 
-When a model is found, the heuristic was wrong (low probability): 
let c’ be the smallest clause w.r.t. < s.t. mark c’ is true in this 
model. Then all clauses c” that are such that c" < c’ are cancelled 
from L. We restart Step 2. 

(b) If no model exists, this means that our heuristic was wrong. We go to Step 
2 . 

Proposition 4. 

Under the single kernel assumption, Procedure 1 allows us to Snd a maximal 
consistent inclusion-based or lexicographic consistent sub-base ofKB‘. 

Let us stress that this procedure will require all elements from L to be considered in 
the worst case. Accordingly, it is not the most efficient one for those worst cases; 
however, we believe that this price to be paid is acceptable in regard of the very good 
experimental efficiency with most problems. 

In some applications, we need to find all maximal consistent inclusion-based or 
lexicographic consistent sub-bases. Obviously enough, any other such sub-base can 
only be obtained by retracting another (different) clause from In this respect, 

we can check the consistency otKB’\ {c’} for all other clauses c’ from the stratum to 
which all elements of belong. We can often reduce this additional number of 
calls to consistency checks by means of a dichotomic strategy. Let us note va the 
clause made of the disjunction of all clauses in this stratum, va is split in a 
dichotomic way into clauses c', according to which the consistency KB’ \ {c’} is then 
checked. Whenever an inconsistency is proved w.r.t. such a clause c\ we know that 
c’ does not contain any clause from and thus that c’ does not need more 

splittings. Moreover, this strategy can be mixed with the heuristic result asserting that 
clauses are classified in L according to their decreasing experimental probability of 
belonging to 



5 The General Case 

In the general case, we cannot assume the existence of a unique kernel when KB’ is 
inconsistent. All we know is that /will belong to the intersection of the kernels, but 
this information is often useless since / is often put in the lowest stratum and 
preserved when consistency is restored. 

The nice heuristic in the trace of the failed local search for a model of KB’ in the 
previous section allowed us to obtain a list L (after Step 2 of Procedure 1) whose first 
element, a clause c, was most probably one of the clauses in IP(^’) that exhibits the 
lowest preference. The existence of multiple kernels does not change the 
experimental value of this heuristic. Assume that A[B’’ - KB’, initially. Accordingly, 
we might decide to switch our policy with respect to Step 3 of Procedure 1, The idea 
is to insert c in a list called Dropped in any case. Let AB” = KB ” \ {c}. If we reach 
3(a) then we stop (although we could roughly import the previous contents of 3(a) 
here). If we reach 3(b) then we iterate the whole process w.r.t. KB Clearly, such a 
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procedure will provide us with a list Dropped that would ensure that consistency is 
restored when all its elements are dropped from KB For clarity of presentation, let 
us give a more general procedure, noting that it can also be easily adapted to run when 
Step 3(b) of Procedure 1 is reached. 

Procedure 2. 

1. Let KB ” = KB ’ and Dropped = 0, initially. 

2. A greedy local search procedure is run on KB”. When it fails to prove the 
consistency of KB " after a preset computing time, a complete DP-based search is 
performed, focusing on the trace of the failed local search. When KB” is 
inconsistent, the trace of the search delivers a list L of clauses c that are such that 
c a /and that are sorted in decreasing order w.r.t. to the number of times they 
have been falsified during the search. 

3. L is re-organized in such a way that a clause that belongs to the most often 
falsified ones and that exhibits the highest stratum among these latter ones 
appears first. 

4. The first clause c in the list L is the clause of IP(/CB ”) with the lowest preference, 
probably, c is put in Dropped and KB ” = KB ” \ {c}. UKB” is consistent then the 
procedure stops else we go to Step 2. 

Proposition 5. 

In the worst case, Dropped consists of m - 1 clauses and it computation belongs to 
ppNP( 0 (m)i^ where m is the total number of clauses in KB 

However, our experiments show that Procedure 2 often requires a number of calls 
to a satisfiability check that is close to the number of kernels of W^KB’), which is 
often a very small number in most realistic situations of incremental construction of 
belief bases. When the heuristic about local search works nicely. Procedure 2 delivers 
us clauses of an increasing preference, each of them allowing at least one kernel to be 
broken if the clause is dropped. However, a well-known problem is that dropping 
clauses starting from the least preferred ones can make us lose too much information. 
For instance, let us assume the existence of two kernels {ci, c^} and {c^, C 34 }. If we 
start breaking the second kernel by dropping C 34 because it contains a less preferred 
clause, then we shall also drop Cn to break the first kernel. However, dropping Cyj 
was enough to break the second kernel. Moreover, the heuristic can be wrong in the 
sense that clauses that do not belong to IP(^’) can have been inserted in Dropped. 

In order to recover from these two possible drawbacks, an additional step in 
Procedure 2 is necessary. Its goal is to filter the elements of Dropped in order to 
converge towards a set of clauses that an inclusion-based ordering would require us to 
drop. All clauses c from Dropped are considered successively, according to their 
decreasing preference. A clause c is kept in a final set Dropped’ only when c belongs 
to a kernel, taking into account the previously preserved clauses from Dropped. More 
precisely. 

Filtering Step 

First, Dropped is sorted w.r.t. the decreasing preference pre-ordering of its clauses, 
i.e. we take the clauses that are minimal w.r.t. < in the first place. Let iCB” = KB’. Let 
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Dropped’ = 0. Each clause c of Dropped is then considered, successively. The 
consistency of KB ” \ {Dropped} U { c } is then checked. If it is consistent then KB ” 
= KB ” U { c } else Dropped’ = Dropped’ U { c }. 

Clearly, Dropped’ is an approximation of the clauses that are dropped to obtain a 
maximal inclusion-based consistent sub-base of KB’. It is an approximation in the 
sense that it is not guaranteed that the lowest priority piece of information has been 
dropped for each kernel. 

Proposition 6. 

When the trace of the local search allows us to extract a superset of IP(KB ’) as the 
part of L that is interpreted as containing the most often falsified clauses, then the 
combination of Procedure 2 and the Filtering Step computes a maximal inclusion- 
based consistent sub-base of KB ’. 

Fortunately, the heuristic about local search proves often accurate and since we sort 
the first clauses in the list L according to their increasing preference, a right set of 
clauses to be dropped is found very often. This is confirmed by our experimental 
results. 
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6 Experimental Results 

In the following, the local search procedure TSAT is used [Mazure et al. 97a, 98b]. 
with its weight option and a tabu list set to its general-purpose recommended length 
18. All tests were conducted on a 133 pentium, under Linux. Due to lack of space, 
we just relate here an example about the single kernel assumption. The example is 
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built in an artificial way to illustrate the involved concepts, but its computing time 
results are typical, even for more elaborate or natural problems. More extensive 
experimental results are presented in the full paper. 

We considered the (difficult) SSA7552-038 VLSI-related benchmark [DIMACS 93] 
consisting of 3575 clauses and 1501 variables. This instance was shown consistent in 
less than 43sec. using TSAT. Then, we built a simple inconsistent instance (called 
“insat”) made of 4 clauses, using variables from the SSA problem, and added 3 
clauses fi^om “insat” in a random way inside the SSA problem to get KB. Consistency 
of KB was then shown in less than lsec28, using TSAT. We introduced the fourth 
clause from “insat” in the same way to yield an augmented SSA instance KB’ that is 
now inconsistent. 

After a 3 minutes search by TSAT on KB’, the trace in Figure 1 was obtained. Qause 
Cj,jj exhibits a very high score and is the less preferable one among the most often 
falsified clauses. Assuming that a complete ordering is given between the 3579 
clauses, we extract from the instance and show that consistency is restored, using 
TSAT in less than 0.73 sec. Showing that consistency cannot be restored with a less 
preferred clause is then achieved in less than 0.01 sec. Accordingly, SSA7552-038 U 
“insat” \ {Cj,^} is thus the unique maximal lexicographic (or inclusion-based) 
maximal consistent sub-base of SSA7552-038 U “insat”, with respect to the 
considered complete ordering. 



7 Conclusions 

Inconsistency handling has mostly been considered from a conceptual point of view, 
only. Worst-case complexity shows that most approaches can be actually intractable 
since they can exhibit exponential computing time (unless P = NP). Few actual 
techniques allow large belief bases to be considered. To address the implementation 
issue for realistic and large belief bases, as Nebel points it [Nebel 96], we might have 
to consider approximation techniques. We also have to adapt the best efficient 
consistency checking techniques and find appropriate powerful heuristics. This paper 
is a first step in these complementary directions, that is promising in regard of our 
experimental results. Actually, we are not aware of any other approximation or exact 
technique that can handle as large belief bases as our technique permits it. In the next 
future, we hope to extend this approach to compute all the lexicographic and 
inclusion-based maximal consistent sub-bases in the general case. Also, we believe 
that the approach could prove useful in computing nonmonotonic inferences based on 
maximal consistent sub-bases. Very promising results have already been obtained in 
this direction [Gregoire and Sais 99]. 
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Abstract. This work has two objectives: first, to introduce rough set 
theory, developed by Pawlak, to a wider audience; second, to present 
compntational methods for the theory, allowing it to be implemented 
in many more systems. Rongh set theory is a new mathematical tool 
to deal with vagueness and uncertainty. This approach seems to be of 
fnndamental importance to artificial intelligence and cognitive sciences. 
Althongh the burgeoning methodology has been successful in many real- 
life applications, there are still several theoretical problems to be solved. 
We need a practical approach to apply the theory. Some problems, for 
example, the general problem of finding all reducts, are NP-hard. Thus, 
it is important to investigate computational methods for the theory. We 
present compntational methods for the theory of rongh sets and know- 
ledge discovery in databases. Emphasizing applications, we illustrate our 
methods by means of rnnning examples using data of fin diagnosis. 



Introduction 

In this paper we present computational methods for the theory of rough sets 
and knowledge discovery in databases. Let C/ be a universe and 0 the set of 
equivalence relations on U . In section 1, algorithm P dynamically giving clas- 
sification for an equivalence relation with the time complexity 0(|C/|) is given. 
The algorithm can be run in parallel mode to find concurrently all corresponding 
classifications for many equivalence relations. In section 2, we introduce rough 
subsets and support subsets. Algorithm L giving the lower approximation with 
the time complexity 0(|C/|) are given. The time complexity for computing a sup- 
port subset from an equivalence relation is 0(|C/p). In section 3, algorithm I for 
computing an intersection of two classifications with time complexity 0(|C/p) 
is given. By algorithm I, the time complexity for computing a support subset 
from many equivalence relations is 0(|6>||C/p). Section 4 discusses functional 
and identity dependencies. The time complexity to check the dependency is 
0(|6>||C/p). Section 5 introduces the significance of an equivalence relation 6. 

A. Hunter and S. Parsons (Eds.): ECSQARU’99, LNAI 1638, pp. 179-189, 1999. 
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The time complexity for computing a significance is O{\0\ x |C/p). Algorithm C 
with the time complexity 0(|6>p|C/p) is presented to allow us to find the core, 
that is, the significant subset of O. Section 6 discusses reducts, that is, the min- 
imal identity dependent subsets of 0. We present Algorithm A with the time 
complexity 0(|6*p|f7p) to find one reduct. The algorithm can be run in parallel 
mode to find concurrently all reducts. Algorithm H with the time complexity 
0(|6*p X |17p) and algorithm H' with the time complexity 0(|6*p x |17p) to find 
at most one key are presented in section 7. In section 8, algorithm D to discover 
knowledge with the time complexity 0(|6>p|f7p) is given. 

1 Dynamic classifications 

Let f7 be a universe. For every equivalence relation 9 on U, we have a corres- 
ponding partition U/9 in universe U as follows: two objects u,v € U are in the 
same class if and only if u6v. 

Algorithm P. Let U = {ui,U 2 , ■■■, Ui, ..., U|£/|}. This algorithm dynamically 
gives classification U /9. We use the following pointers: i points to the current 
input object Ui] s records that we have found s classes Fi, V 2 , ■■■, 14; j ranges 
1, 2, ..., s to check whether or not Vj9ui for the current input object Ui. If Ui9Vj 
for some j then put Ui in Vj: Vj 9 Ui. Otherwise, establish a new class: s <— 
s -I- 1 , 14 = {Ui}. 

Upon the algorithm is completed, we have Uj6 = {Vi, V 2 , ■■■, 14}. 

PI. [Initialize] Set z ^ 1, j ^ 1, s ^ 1. 14 = {zzij. 

P2. [Is UiOVj ?] If UiOVj then Vj 9 Ui. Otherwise, go to P5 to check the next 
Vj if any. 

P3. [Is i = [U[ ?] If z = [U[ then the classification is completed, and we 
have U/9 = {I 4 , 14, ■■■, 14}- If z < [U[ then go to P4. 

P4. [Increase z] z <— z -I- I, go to P2. 

P5. [Is j = s l]lf j = s then establish a new class s ^ s -I- 1 , 14 = {ui} and 
go to P3 to input the next object if any. If j < s then go to P6. 

P6. [Increase j] j ^ j + 1, go to P2. 4 

The algorithm can be run in parallel mode to find concurrently all corres- 
ponding classifications for many equivalence relations. 

Example. The following is a decision table (Pawlak et al 1995). 



U\A 


1 




03 


9 


Ui 


1 y 


V 


0 


n 


U2 


1 y 


V 


1 


y 


U3 


1 y 


y 


2 


y 


ZZ4 


1 n 


y 


0 


n 


U 5 


1 n 


n 


1 


n 


Uq 


1 n 


y 


2 


y 



where U = {zzi, ZZ2, zzs, ZZ4, zzs, uq\ is the universe, A = {9\, 6*2, 6*3, 9} is the set 
of attributes: 9\ - headache, 6*2 - muscle pain, 6*3 - temperature, 9 - decision flu. 
And for short, y - yes, n - no; 0 - normal, 1 - high, 2 - very high. 




Rough Knowledge Discovery and Applications 181 



For every attribute r, we have a corresponding partition U jr such that two 
objects Ui, Uj G U are in the same class if and only if r(ui) = T{uj), where r(ui) 
is the entry on row Ui and column r in the table, and T(uj) is the entry on row 
Uj and column r in the table. 

For our example, algorithm P will find 

U/ 9 i = {{ui, U2,U3}, {u4 ,U5 ,Uq}}; 

U/Qi = {{ui,U 2 ,U 3 ,U 4 ,Ue}, {us}}; 

U/O3 = {{ui, U4}, {U2, U5}, {U3, Ue}}; 

U /0 = {\VuW2}, Wi = {U1,U4, U5}, W2 = {U2, U3, Uq}. 



2 Support Subsets 



Let VF be a subset of U and 6 an equivalence relation on U. Subset Se{W) = 
^v&u/sy<zwy is called the lower approximation or support subset to W from 9. 
and sptg{W) = |S'e(VF)|/|C/| is said to be the support degree to W from partition 
UjO. 

Algorithm L. Let C/ be a universe and 9 an equivalence relation. Let 
U j9 = {Vi, V 2 , ..., Vs}, 1 < s < |C/|. Let W C U. This algorithm gives the lower 
approximation = Uv=Vi,V 2 ,---,Vj,---y;VcwV to W from U/9. 



LI. [Initialize] Set j ^ 1, L ^ 0. 

L2. [Is C IF ?] If Vj C IF, then L ^ LUVj. Otherwise, go to L3 to check 
the next Vj (if any). 

L3. [Is j = s 7] li j = s then the algorithm is completed, and we have 
IF® = L. Otherwise, go to L4 to check the next Vj. 

L4. [Increase j] Set J ^ J + 1. Go to L2. <(> 

For our example, algorithm L will find 
spie.(VLi) = ^ = 0, 



Wl 



spteAW2) = = ^ = 0; 



Wl 



snta (W, \ _ I {“ 5 } I _ 1 

Sptg 2 [Wl) — ^ — 1(71 — g 



wr 



6 ’ 



sptg,{W2) = = ^ = 0; 



wr 



('T/P, 1 — I •^‘’ 3 (^ 1 ) I _ I{“1.“4}| „ 2 
sptgs(Wi) — ^ — g 






sntn ('Wo'l — I •^‘’3 (^ 3)1 _ |{m3,«6}| _ 2 
Sptgs\,W2} — 1^7] — fgTl — g. 






Let {7 be a universe and 9, r two equivalence relations on U. The support 
subset of r from 9 is subset Sg{r) = Uw&u/tW^ =, and sptg{r) = |S'e(r)|/|{7| 
is called the support degree to r from 9. 

VU jr = U j 8, where & is the “universal” partition U jS = {U}, then we have 
Sg{r) = U, sptg{r) = 1 for all equivalence relations 9. 

For our example, we have sptg^{9) = 0, sptg.^{9) = i, sptgs{9) = |. 
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3 Supports from Many Partitions 



Let C/ be a universe and (6*, fl) the semi-group of equivalence relations. For a 
set 17 C 6> of equivalence relations we define classification [//17 in universe U as 
follows: two objects u,v € U are in the same class if and only if uOv for every 
0 G 17, i.e., u{(^0^a9)v- We also define 17/0 = U/5, i.e., = 6, where S is 

the “universal partition” U/S = {U}, \ U/6\ = 1. 

Algorithm I. Let 17 be a universe and 6*i , 62 be two equivalences. 

Let 17/01 = {Vii, V12 , ..., Vii , ..., Fl J, 1 < s < |17|; 

17/02 = {'F 21 , t^ 22 , 'F 2 J, V2t}, 1 < t < |17| for 01, 02 G e. 

This algorithm gives classification 

17/0102 = {Vi, ^2, Vk , ..., Vr}, 1 < r < si. 

We use the following pointers: z = 1, 2, ..., s points to Vu] j = 1, 2, ..., t points 
to V2j] r records that we have found r classes Fi, V 2 , ■■■, K of 17/0102. 

For every z and every j, we check whether or not Vnr\V2j = 0. If Vnr\V2j = 0 
then simply ignore it. Otherwise, establish a new class: r ^ r+l,Vr = VuC] V2j. 

11. [Initialize] Set z^l,j^l,r^0. 

12. [7s Vii n V2j = 0 ?] If intersection Vii H V 2 j = 0 then go to 13 to check the 
next intersection. Otherwise, set r ^ r+1 and establish anew class V7 = VunV2j 
for Uj9i92- Go to 13 to check the next intersection. 

13. [7s j = i ?] If j = i then go to 15 to check next z. Otherwise, go to next 
step 14 to see next j. 

14. [Increase j] Set j <— j -V 1. Go to 12. 

15. [7s z = s ?] If z = s then the classification is completed, and we have 
17/0102 = {V7, V 2 , Vr}. Otherwise, go to 16. 

16. [7rzcrease z] Set z ^ z -I- 1, j <— 1. Go to 12. /> 

For our example, algorithm I will find 

17/0102 = {{zzi, ZZ 2 , U3}, {zz4, zze}, {zzs}}; 

17/0203 = {{ui,U4}, {U 2 }, {U3, Ue}, {zzs}}; 

17/0103 = 17/010203 = {{wi}, {U2}, {U3}, {zZ4}, {zzs}, {ue}}- 

Let VF C 17 be a subset of universe U. For a subset 17 C 6>, the support subset 
of VF from 17 is subset Sq{W) = of U and sptQ{W) = |S'42(VF)|/|17| 

is called the support degree to VF from 17. 

We know the following. 

(1) Let I7i,l72 C 6> be two subsets of 0. If I7i C 172 then Sq^{W) C 
Sn^{W),sptn^{W) < sptn^{W). 

(2) Let l7i, 172 C 6> be two subsets of O. Then 

So^uoAW) 2 SoAW),Sa,iwy,sptn,ua,{W) > sptn,{W), sptn^W). 



For our example, we have 
spte,0,{W,) = i 

spU,0,{W2) = = 0. 

n (W-, 1 — I ‘^'’ 3 1^2 (Wl)\ _ \Wi\ _ 3 

"nfi 1 — l‘®'’ 3®2 ('^ 2)1 _ IW 2 I _ 3 

SPt6362ltt2j — [fjj — — g. 

a (Wt \ (B^i)l _ |M^i| _ 3 

Spt0^0^[Wl) rjJT g g, 
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_ Ata t I -Soj 83 (14^2)1 IW2I 3 A 

spte^e^[W2) — TpT — — 6- V 



a3\”-z) — 

Let A C O he a subset of 0 called the decision partition subset. For a 
subset f2 C 0 called the condition partition subset, the support subset of the 
decision partition subset A from condition partition subset 17 is subset Sq{A), 
and spta{A) = |S'r2(Z\)|/|C/| is called the support degree to A from 17. 

We know the following. (1) If U/A = U/S, where <5 is the “universal” par- 
tition U/S = {U}, then we have Sq{A) = U,sptn{A) = 1 for all 17 C 6>. 
(2) Let I7i,l72 C 6> be two condition partition subsets. If l7i C 172 then 
SoiiY) C Sa^iX), spta^iA) < spta^{A). (3) Let I7i,l72 C 6> be two con- 
dition partition subsets. Then 5'r2juD2(^) 2 Sq^{A), So^{A); spto,,un 2 {X > 
sptf2^{A),sptQ.^{A). 

For our example, we have spte,^e^{9) = 1/6, sptg^e.^{9) = 1, spte,^e,^(9) = 1, 
5^^616263 (6*) = 1-0 



4 Functional Dependencies 

A functional dependency (for a subset A C0) between two subsets l7i, 172 C 0, 
is a statement, denoted by l7i ^ 172 (L\), which holds, if and only if, we have 
that 5^23(4) D 5^23(4); i.e., for every W G U/A we have S'r2i(VF) D 5^22(1^). 
We know that if l7i D 172 then l7i — > 172 (A). 

For our example, we have {6*1, 6*2, 6*3}, {6*2, 6*3}, {6*i, 6*3} 

^ {6*1, 6*2}, {6*2} — *■ { 9 i } { 9 ). 

An identity dependency (for a subset A C 0) between two subsets l7i , 172 C 
0, is a statement, denoted by l7i ^ 172 (L\), which holds, if and only if, we have 
that Sf 2 ^(A) = Sfi^CA); i.e., for every W G U/A we have Sn^{W) = Sq^{W). 
For our example, we have {0i, 6*2, ^3} ^ {6*2, 6*3} ^ 6*3} 

{ 01 , 92 } ^{ 92 } { 9 ). 



5 Significance and Core 

Let 17 be a nonempty subset of 6>: 0 C 17 C 6>. Let A be a subset of 6>: A C 0 
such that A^%, U/A ^ {7/<5 = {{/}. 

Given an w G 17, we say that u> is significant (for A) in 17 if Sq{A) D 
<S'r2-{a;}(A); and that to is not significant or nonsignificant (for A) in 17 if 
Sn(A) = Sf 2 -{uj}(A). That is, w G 17 is significant (for A) in 17 if and only 
if 17 <A 17 — {w} (A); w G 17 is not significant (for A) in 17 if and only if 
17 17 — {w} (A). 

Given an w G 17, we define the significance of uj (for A) in 17 as sigQ_^^y (w) = 
(^)l ^ special case where 17 is a singleton, 17 = {w}, we also 

denote sig^{uj) by sig^{uj): sig^{uj) = sig^{uj) = So 

we always have sig^{uj) > 0 unless S'aj(A) = 0. 

We know the following: (1) 0 < sig^_^^~^{oj) < 1. (2) oj is significant (for A) 
in 17 if and only if sig^_^^~^{uj) > 0. 
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For our example, we have sig® ( 6 * 1 ) = 0,sig^{92) = g,sz 5 ®( 6 * 3 ) = sigg ( 6 * 2 ) = 

h = 0 , sigl^iOs) = 1 , sigl^{ 9 i) = sigl^{ 9 s.) = 5 ^ 503 (^ 2 ) = §, 

5*56162(^3) = |, 5*56362(^1) = 0 , sigl ^ g ^{ 92 ) = 0 . 0 

The set of all w G f? which are significant (for A) in f? is called the core of 
17 (for A), denoted by Cq. That is, Cq = {to G n\sigQ_^^^{uj) > 0}. Also, we 
define = 0 . 

Algorithm C. This algorithm computes the core Cq of 17 (for A). Let 
17 = {oji,uj 2 , •••, W|r 2 |}- We use Cq 9 w to represent that oj is included in Cq 
(i.e., OJ is significant (for A) in 17). 

Cl. [Initialize.] Set i ^ 1. 

C2. [Is sig^_^^.y{uji) >0 ?] If sig‘^_^^,^{oJi) > 0 go to C3. If (^i) = 

0 go to C4. 

C3. [Set Cq 9 oji-] Set Cq 9 oji go to C4. 

C4. [Is i = 1 17 1 ?] If z = 1 17 1 then the algorithm is finished and Cq is the core 
of 17 (for A). If z < |17| then go to C5. 

C5. [Increase i.\ Set z ^ z + I. Go to C2. <) 

For our example, algorithm C will find 62 63} = ^ 



6 Reducts 

A nonempty subset 17 is said to be independent (for A) if each w € 17 is significant 
(for A) in 17; otherwise 17 is dependent (for A). An empty set 0 is said to be 
independent (for A). 

For our example, we know the following. 

(1) {5i, 02, 53 }, {5i, 02}, {5i} are dependent (for 9). 

(2) {92, 9s}, {9i, 9s}, { 92 }, { 93 } are independent (for 9). 0 

A subset l7o of 17 is said to be a reduct of 17 (for A) if l7o satisfies (1) 
Sng{A) = S'r 2 (A); i.e., l7o ^ 17 (A); (2) if 17' c l7o then Sq{A) d Sq>{A); i.e., 
if 17' C l7o then 17' <A 17 (A). The empty subset 0 has reduct 0 (for A). That 
is, 17 q is a minimal identity dependent subset of 17. 

Theorem. Every (nonempty or empty) subset 17 of 6* has a reduct l7o C 17 
(for A) such that (1) = Sa(A); i.e., l7o ^ 17 (A); (2) if 17' C l7o then 

Sa{A) D So'{A); i.e., if 17' C l7o then 12' ^ 12 (A); (3) l7o is independent (for 
^)- 

This theorem gives an algorithm to find one reduct (for A). That is the 
following. 

Algorithm A. Let 17 = {oji,uj 2 , ■■■,oj\q\}. This algorithm finds one reduct 
of 17 (for A). 

Step 1. Compute sigQ_^^,.^{ojj) for j = 1,2, ..., |17|. Choose ojj^ such that 
5*5fi-{o;2A(^ii) = ******6 = 1.2.. ...|r2|(5Z5()_{^.}(Wj)). 

If sigQ_^^, j(wji) > 0 then 17 is one reduct of 17 (for A) and the algorithm 
is completed. Otherwise, go to step 2. 

Step 2. Compute for j = 1, 2, ..., |l7|;j y* ji- 

Choose ojj 2 such that 
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If > 0 then 17 — {ojj^} is one reduct of 17 (for A) and 

the algorithm is completed. Otherwise, go to step 3. 

Step 3. Compute for j = 1,2, ..., |17|; j ^ ji, j2- 

Choose Wj3 such that 

— ™’^i = l,2,...,|r2|;iAfl 

If > « then n is one reduct of O (for 

A) and the algorithm is completed. Otherwise, go to step 4. 

And so on. 

Step |17|. Compute sig^{ujj) for j = 1, 2, ..., |17|; j ^ ji, j 2 , ■■■, j\o\-i- If 
sig^{u!j) > 0 then {oJj} is one reduct of 17 (for A) and the algorithm is 
completed. If sig^{ojj) = 0 then 17 = {w}, U/oj = {U} and the empty set 0 is 
one reduct of 17 (for A) and the algorithm is completed. <(> 

The algorithm can be run in parallel mode to find concurrently all reducts if 
we choose all opportunities at each step. The following is an example. 

For our example, algorithm A will be run as follows. 

Step 1. sigl^gJOs) = 5/6, sig^^gJOi) = sig^^gje 2 ) = 0 
Step 2. A. Choose ji = 1: sigl^{ 6 i) = 5/6, sigl^{ 62 ) 

= 2/6 We find that {02,^3} is one reduct of {6*i,02,^3} (for 6 ) and the 
algorithm is completed. 

Step 2.B. Choose ji = 2: sigg_^{ 6 ^) = l,sigg^{ 6 i) = 2/6. We find that 
{6*1, 6*3} is one reduct of {6*1, 6*2, 6*3} (for 9) and the algorithm is completed. 
Thus, we find that all reducts of {6*1, 6*2, 6*3} (for 9) are: {6*1, 6*3} and {6*2, 6*3}. 

❖ 

7 Finding One Reduct 

Using significances, we can design the following algorithm: 

Algorithm H. Let 17 = (wi, W2, ..., W|i2|} be a (nonempty or empty) subset 
of 6>: 0 C 17 C 6*. This algorithm finds at most one reduct of 17 (for A). 

Step 1. If |17| = 0 or |17| = 1,17= |wi}, (A) = 0 then 0 is the unique 

reduct of 17 and the algorithm is completed. If |17| > 1 and there exists an w G 

17 such that S'aj(A) 76 0 then compute sig^{ujj) = for j = 1, 2, ..., |17|. 

Suppose that sig"^{uj^) > sig"^{uj^) > sig^{u!j,^) > ...sig^{u!j^^^_,^) > 
sig^{u!j^^^), where sig"^{uj^) > 0. 

Step 2. If sig^{ujj^) = 0 then (A) = (A) = ... = 8 ^,^^^^ (A) = 0 so 

we can take 17 = {wjj , Wjs }) Wji } b® ^be unique reduct of 17 = , ujj^ } 

({wjj} is independent) and the algorithm is completed. If sig"^{uj^) > 0 then 
compute and check if sig"^^ }('^i2) = 0 ? II }('^i2) = 0 then {wjj} may 

be one reduct of 17 and the algorithm is completed. If sig"^^, ^(wjj) > 0 then 
go to the third step. 

Step 3. If si 5 "^(wj 3) = 0 then (S'a;^3(A) = 8 ,^^^{A) = ... = 5”,^i|„|(A) = 0 
so we can take 17 = {wjj, Wjj, w^^}) may be one reduct of 17 = 
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{wjj, Wjg} and the algorithm is completed. If sig^{u!j^) > 0 then compute 
and check if ,^.^(^^ 3 ) = 0 ? 

If sig'^^, = 0 then {ojj^jOjj^} may be one reduct of 17 and the 

algorithm is completed. If sig'^^, > 0 then go to the fourth step. 



And so on. 

Step |17|. If = 0 then (A) = 0 so) l7-{wj|„|} may be 

^j\n\ 

.A 



one 



reduct of 17 and the algorithm is completed. If sig^{u!j.^.) > 0 then compute 



and check if sigf , = 0 ? If sigr" , = 0 

then 17 — {ojj^^^} may be one reduct of 17 and the algorithm is completed. 
If sigi^ i(wi,„,) > 0 then 17 may be one reduct of 17 and the 

algorithm is completed. <0 

For our example, algorithm H will be run as follows. 

Step 1. sig^{9^) = 4/6 > sig^{92) = 1/6 > sig^{9i) = 0. So 9j-^ = 9^,9j^ = 

02 , 0J3 = 01 . 

Step 2. sig^{92) > 0. sigg^{92) = 2/6 > 0 

Step 3. sig^{9i) = 0. So { 03 , 02} may be one reduct of |0i, 02 , 03 } (in fact, 
this is true) and the algorithm is completed./) 

Algorithm H only considers the order of significances 

> sig^i^h) >■■■> 

We can also consider significance step by step. This leads to the following 
algorithm. 

Algorithm H' . Let 17 = {wi,o; 2 , ..., W|r 2 || be a (nonempty or empty) subset 
of 6>: 0 C 17 C 6>. This algorithm finds at most one reduct of 17 (for A). 

Step 1. Compute sig^{ujj) = for j = 1,2, ..., |17|. Choose ojj^ such 

that sig^{u!j^) = maXj=i^ 2 ,...,\n\{sig^{oJj))- If sig^{ujj^) = 0 then 17 = {0} with 
Sg{A) = 0. So 0 is the unique reduct of 17 and the algorithm is completed. 
Otherwise, go to step 2. 

Step 2. Compute sigf^,^{ujj) for j = 1, 2, ..., |17|; j yf ji. 

Choose U!j 2 such that sig'^^ }('^j 2 ) = nzaa;j=i_2,...,|i7|pAii('®*5{lj }('^i))- 
sig‘^^. }(wj 2 ) = 0 then may be one reduct of 17 and the algorithm is 

completed. Otherwise, go to step 3. 

Step 3. Compute sigf^,^ ^^,^y(ujj) for j = 1, 2, ..., |A|; j yf Ji, J 2 - 



Choose Wj 3 such that 



If 



'Jl J 

Zi 






s^g^^. }(i^i 3 ) = 0 then {ojj^jOjj^} may be one reduct of 17 and the 

algorithm is completed. Otherwise, go to step 4. 

And so on. 

Step |17|. Compute 



sig^ 



^31 >‘^J2>--->“j|f2|-l 



y{ujj) for j = l,...,|l7|;jyf ji,...,j| 32 |_i. 
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If = 0 then fl — niay be one reduct of fl and 

the algorithm is completed. If > 0 then Q may be one 

reduct of 17 and the algorithm is completed. <) 

For our example, algorithm H' will be run as follows. 

Step 1. sig^{ 6 ^) = 4/6 > sig^{ 62 ) = 1/6 > sig^ {9i) = 0. So 6 j^ = 6*3, and 
sig^iOs) > 0 . 

Step 2. sig^^{92) = sig^^{9i) = 2/6 > 0. 

Step 3. A. Choose 9j^ = 9^. sigs^e-i{9\) = 0. So {6*3, 6*2} may be one reduct 
of {6*1,6*2,03} (in fact, this is true) and the algorithm is completed. 

Step 3.B. Choose 9j^ = 0i. sigs^eA^ 2 ) = 0. So (03, ^2} may be one reduct 
of {01,02,03} (in fact, this is true) and the algorithm is completed. 

8 An Algorithm to Discover Knowledge 

Applying algorithm H' to decision tables, we can design an algorithm to discover 
knowledge. 

Algorithm D. Let T> = ({7, A) be a decision table with A = 17U A, where 17 is 
the condition attribute set and A is the decision attribute set. Also, U/A^{U}. 

Let 17 = {01, 02, ..., 0j, ..., 0j}. This algorithm is to find one reduct of 
17 (for A) and discover knowledge. 

Step 1. Compute U/A = {Wi,W 2 , and sig^{9j) for j = 

1,2,...,J. For every i,j such that S'e^(IFi) yf 0 discover the following rules: 
{u&Se,{Wi)) 

If (0j) 9j is 9j{u) Then (r) r is r(u) with strength sptg.{Wi). 

If Wi = {IFi G U/A\Sg^{Wi) C Wi for all j} is empty then the algorithm is 
completed. Now for each Wi G U/A there is a 9j such that Sgj(Wi) = Wi. The 
collection of these 9j’s may be a reduct (for A). Otherwise, choose such 

that sig^{9j^), sig"^{9j.^) are maxima and go to the second step. 

Step 2. Compute Sg^^g^^(Wi) for all Wi G Wi. 

For Sg^ g^ (W7)j where Wi G Wi, we can discover the following rules: {u G 
Sg,^g,^{Wi)-Sg,^{W.)-Sg,^{Wi) C W.) If (0, J 0,y is 0,y(u), (0,J TJ, is Tj,{u), 
Then (r) y is t{u) with strength sptg._^g.^{Wi). 

If W 2 = {Wi G Wi\Sg^^gj^{Wi) C Wi} is empty then the algorithm is com- 
pleted and {0ji, 0^2} may be a reduct (for A). Otherwise, go to the third step. 

Step 3. Compute sigf.^g.^{9^) for j y6 j\, j2. 

Choose a 9j^ such that sig^. g. (9j^) is a maximum. 

For (IFi), where Wi G VV2, we can discover the following rules: 

{u G Sg,^g,X(Wi) - Sg,^g,^(Wi) C W.) 

If i^ji) Sji is 0ji(u), (0^2) 0J2 is 0j2(w). (%) is %(w) Then (r) r is r(u) 
with strength sptg.^g.^g.^{Wi) 

If W3 = {ITi G W 2 \Sg^j^^g^^{Wi) C VFi} is empty then the algorithm is 
completed and {0ji , 0j2> may be a reduct (for A). Otherwise, go to the 
fourth step. 
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Step 4. Compute J 7^ jij J2, js- Choose a 6j^ such that 

is a maximum. For where Wi G W3, we can 

discover the following rules: (u G ~ ‘^eu«J2«J3 

( 0 jJ 0 j, is 0 j,(u), ( 0 J 2 ) &32 is %(u), (%) dj3 is dj^{u), ( 0 jJ 0 j^ is 0 j^{u) Then 
(r) r is r(u) with strength sptg.^g.^g.^g.^{Wi) 

If >V4 = {Wi G Wsl-S'sjj Sj^VCi) C VFi} is empty then the algorithm is 
completed and {0ji,0j2,(^j3,Sj4} may be a reduct (for A). Otherwise, go to the 
next step. And so on. 

Step | 17 |. Compute sig^.^g.^ g.^^^_^{ 0 j) for j ^ :.,j\a\-i- Let 0 j^^^ = 

0 j. For (VFi), where Wi &W\n\-i,^e can discover the follow- 
ing rules: {u G iWi)- Sg^^g^^...g^^^^_^{Wi) C VF,) If 0 j, is 

(^^2) ^t2 is ^i2(^^)> ■■■> (^i|f2|-i) ^i|f2|_i is (^i|f2|) ^i|f2| is ^i|f2|(^^) 

Then (r) r is r(u) with strength sptg.^g.^,,,g.^^^_^g.^^^{Wi) 

Then the algorithm is completed and fi may be a reduct of fi (for A). <(> 

For our example, algorithm D will be run as follows. 

Step 1. (1) sig^{0i) = 0. 

( 2 ) sig^{ 02 ) = 1 / 6 , and Sg^{Wi) = {u3},Sg^{W2) = 0 - So we can discover 
the following rules: (us G {us} C VFi) If (^2) no muscle pain Then { 0 ) no flu 

with strength 1/6. 

( 3 ) sig^{03) = 4 / 6 , and S'e3(VFi) = {^1,^4}, S'e3(VF2) = {u3,ue}. So we can 
discover the following rules: (ui,U4 G {^1,^4} C VFi) If (^3) temperature is 
normal Then ( 0 ) no flu with strength 2 / 6 . 

(u3, uq G {u 3, ue} C IF2) If (6*3) temperature is very high Then (0) yes, it is 

flu with strength 2/6. 

We have sig^{ 03 ) = 4/6 > sig^{02) = 1/6 > sig^{0i) = 0. Wi = {Wi = 
Wi,W2\Sg^{W4) C IF, for all j = 1 , 2 , 3 } = {lFi,lF2| yf 0 . Choose 0 g, = 

^ 3 , 6*42 = 6*2 

Step 2. Sg^g^{Wi) = VFi = {ui,U4,U5},S'e3e2(VF2) = IF 2 = {^ 2 ,^ 3 , uej- 
Since Sg^g^{Wi) - Sg^{Wi) - Sg^{Wi) = {ui, U4, U5} - {ui, U4} - {us} = 0 , there 
are no more rules to be discovered for IFi . For Sg^g^ (IF2) — Sg^, (IF2) — Sg^ (IF2) = 
{u2,U3, uej — {u3,U6j — 0 = {^2}, we can discover the following rules: (for 
U2 G {^2} C IF2) If (6*2) yes, muscle is pain (^3) temperature is high Then { 0 ) 
yes, it is flu with strength 3 / 6 . 

Now, yV2 = {IF, = IFi, lF2|S'e3e2(lLi) C IFi} = 0 , the algorithm is completed 
and {6*3, 6*2} may be a reduct (for 0 ) (in fact, this is true). 

Summarizing, the decision table is reduced to the following. 

U\A I 01 02 03 0 Stg 

ui, U4 I 0 n 2/6 

(stg=strength) U2 \ y I y 3/6 'O’ 

U3,uq I 2 y 2/6 

U3 I n n 1/6 
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9 Summary 

Rough set theory is a new mathematical tool to deal with vagueness and un- 
certainty. This approach seems to be important to artificial intelligence and 
cognitive sciences. In this paper we suggest a series of algorithms for rough clas- 
sification and discovery. Especially, we suggest the use of a significance measure 
to design some algorithms with lower price. 
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Abstract. As shown by Russel et al., 1995 [7], Bayesian networks can be 
equipped with a gradient descent learning method similar to the training 
method for neural networks. The calculation of the required gradients can 
be performed locally along with propagation. We review how this can be 
done, and we show how the gradient descent approach can be used for 
various tasks like tuning and training with training sets of definite as 
well as non-definite classifications. We introduce tools for resistance and 
damping to guide the direction of convergence, and we use them for a new 
adaptation method which can also handle situations where parameters 
in the network covary. 



1 Introduction 

Generally speaking, neural networks have the advantage that they can be trained 
through a training set of observations together with the correct classification. 
For example, a set of symptoms and test results used for diagnosing together 
with the correct diagnosis can form a training set for a neural network, where 
the evidence is entered at the input nodes and the correct diagnosis is stamped 
upon the output nodes. For each case (or for the entire set of cases), a back 
propagation method is inititiated to alter the parameters which are weights and 
thresholds. The ingenious property of the back propagation method is that it 
through local calculations calculates the gradient of the error as a function of 
the parameters, and eventually the parameter vector is changed in the direction 
opposite to the gradient. This method is called gradient deseent. 

There is set of common tasks where Bayesian and neural networks compete 
(generally speaking, classifying problems). One of the virtues of Bayesien net- 
works is that structure as well as parameters have a clear semantics such that 
the result of training is transparent and open for debate. 

Bayesian networks can be trained in a way similar to neural network type 
training. Work was initiated by Laskey, 1990 [3] and Laskey, 1993 [4], and Russell 
et al. [7] take up a specific type of training and show that the required gradients 
can be calculated locally along with propagation. 

In this paper, we first combine the work by Russell et al. with a result from 
Castillo, Gutierrez and Hadi, 1996 [1] to show how simple and efficient it is to 
calculate gradients required for training (section 4). The technique exploits the 
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propagation used, so for this section we assume the reader to be familiar with 
junction tree propagation for Bayesian networks. 

In Section 6, we explain how the technique can be used to simulate neural 
network type training, and it is shown how to handle dependent parameters. In 
Section 7, we introduce resistance and damping for guidance of the direction 
of convergence, and in Section 8 we give a new adaptation method based on 
gradient descent. The adaptation can be performed along with the propagation 
of the case. 

2 Gradient Descent Tuning 

We have a Bayesian network BN. For this network we have some evidence e, and 
for a particular variable A we have x = P(A \ e) = (xi, ...,a;„). We may have 
a prior request y = (yi, ■■■,yn) for F(A | e). So, we want to tune the network 
such that P(A \ e) = y. Assume that the structure of BN is fixed, but for the 
conditional probabilties we have some freedom described by a set of parameters 
t = (ti, with an intial set of values to. So, we want to set the parameters 

so that P{A I e) is sufficiently close to y. 

We introduce a distance measure dist(x,y). Two distance measures fre- 
quently used are the Eucledian distance distE and the Kullhack-Leihler distance 
distK (both derived from strictly proper scoring rules). 

distE{y^,y) = "^{Xi -yif, ( 1 ) 

i 

distK{y^,y) = '^Viilogyi - logxi). (2) 

i 

The task is to set the parameters such that dist{ic, y) is as small as possible. 
If the parameters cannot be set so that the distance is close to zero, it is an 
indication of a wrong structure. 

If it is possible to determine dist{x,y) as a function of t, you might be so 
fortunate that the problem can be solved directly. However, usually the prob- 
lem cannot be solved directly even when the function is known, and a gradient 
descent method can be used: 

1. Calculate grad dist{x,y) with respect to the parameters t 

2. Give to a displacement At in the direction opposite to the direction of 
grad dist(x,y)(to) 

3. Iterate this procedure until the gradient is close to 0 
From the definitions of the two distance measures, we see 

^7 / \ X ' 77/ \ ^Xi 

—distE(^,y) = 2_^^Xi - 



(3) 




192 



Finn V. Jensen 



and 



—distK{x,y) 



E Vi dxj 
Xi dt 



(4) 



The yi& are known, the Xi& are available through propagation in BN . So, 
what we need are grada;j(t) for all i. 

If the variable A is binary, we have x = (a;, 1 — a;), y = (y, 1 — y), and 

distE{^,y) =‘2.{x-yY, (5) 

distK{^,y) = y{\ogy - log a;) + (1 -y)(log(l - y) - log(l - x)), (6) 

and 



grad distE(^,y) = Mx — y)grada;, 

, , / ~ y) \ 

grad distK[^,y) = (— rjgrada;. 



(7) 

(8) 



a;(l — x) ' 

From these formulas, we see that the gradient is 0 if, and only if, either x is 
independent of all the parameters or x = y. 



3 Example 

Let BN be the Bayesian network in Figure 1 with initial probabilities from 
Table 1. Let C be the observation variable and A the variable of interest. Assume 
also that the parameters are t = P{^a),s = P{^b \ ^a),u = P{^c \ ^b). Initially, 
we have to = (0.6, 0.7, 0.4). 




Fig. 1. A small Bayesian network for illustration 



Table 1. Parameters for the network in Figure 1. P{A) = (0.4, 0.6) 



B\A 


a 




b 

-.6 


T 

0 


0.3 

0.7 



Cb 

to 


b 


-.6 


C 


1 

0 


0.6 

0.4 



Assume that we require P{A \ c) = (0.43, 0.57) = (y, 1 — y). Through propa- 
gation we get X = P{a \ c) = 0.48. To get grada;(t), we calculate P{a | c) as a 
function of t: 

( 1 -^) 

(1 — tsu ) ’ 



P(a I c) = x(t, s, u) 



(9) 
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and we have 

grada;(t) = -\suy (^0) 

and 

grada;(to) = (-1.04, 0.139, 0.242). (11) 

Formula 7 yields 

grad disfis(x,y) = 4(0.48 - 0.43)(-1.04, 0.14, 0.24) = (-0.21,0.03,0.05). (12) 

Using a step size of 0.1, we get At = (0.02,-0.003,-0.005) and ti = 
(0.62,0.697,0.395). These new tables give a;(ti) = 0.458. Iterating the process, 
we stop with t/ = (0.644,0.691,0.387) and Xf = 0.4301. 



4 Determining P{A | e) as a Function of t 



By a simple parameter for BN, we shall understand an entry in a table specified 
in BN. 

We have 

x = P{a\e) = (13) 

Then 



dx 



Pie)'- 



^dP{a,e) 3P(e) 

-( b: P(e) -P(a, e) • ^— ) 



dt 



dt 



1 ,dP{a,e) 

'pi^^ m 



dP{e) 
' dt 



)• ( 14 ) 



To determine and we exploit work by Russell et al. [7] and 

Castillo, Gutierrez and Hadi [1]. Russell et al. show that calculation of partial 
derivatives can be performed locally after one propagation. The derivatives they 
seek are slightly different from the ones in the present paper. Also, the method 
presented here is slightly different and relies on the following result which is only 
implicit in the work by Russell et al. 



Theorem 1 (Castillo, Gutierrez and Hadi, 1996). Let BN be a Bayesian 
network. Let a be a state of the variable A, let e be a set of observations, and let 
t be simple parameter in BN. Then P(a, e) as well as P(e) are linear funetions 
in t. 



Proof. The proof exploits that P{a, e) as well as P(e) are marginals of the prod- 
uct of all conditional probability tables attached to BN. As the table containing 
t only occurs once in the product, the result is a polynomial in t of at most first 
degree. 

To illustrate the general principles in calculating the partial derivatives, con- 
sider the following simple example. We have a clique U in a junction tree with 
neighbours lUi and IT 2 (see Figure 2). 
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Fig. 2. A simple junction tree. The clique V holds the parameter t and evidence has 
been inserted through the function ev 

Initially, V holds the function f{V) which is a product of conditional proba- 
bilities of the form P{X \ parents(X)). Assume that the table P{A \ II) contains 
a simple parameter t. That is, for some state a and some parent configuration tt 
we have P{a \ tt) = t. So, we have 

f{V,t)=P{A\n){t)g{V), (15) 

where g{V) is not a function of the parameter t. 

Let the neighbours W\ and W 2 have functions attached over the parameters 
Si and S 2 , and assume too that the evidence e has been entered to the cliques 
through the evidence functions ev, ev\ and ev 2 (see Figure 2). 

Now, the parameters have the values and we wish to calculate 

f(P(e)(tO,s?,s°)). 

After propagation we have for the clique V 



P(V,e)(tO,s?,s°) =/'(V,e)(tO,s?,s°) 

= evP{A I n)(t%(V)h^(S^)(s°)h 2 (S 2 )(st), (16) 



where hi are the functions transmitted to V through the separators Si- As 



P(e) = Y,P(V,e), (17) 

Formula 16 can be used for a local calculation of ^(P(e)(t°, s? , s®)): We have 



P(y,e)(f,s?,s°) 



P(A|P)(f)P(y,e)(fO,s;,sO) 
P{A I P)(fO) 



(18) 



From Theorem 1 we have that P(e) is a linear function in t. So, it is sufficient 
to use Formula 18 and Formula 17 to get another value of P{e){t), and elemen- 
tary arithmetic yields the coefficient to t which then is ^(P(e)(t°, s? , s®))- We 
shall not go into details on how the Formulas 13, 17 and 18 can be used for 
efficent ways of calculating the derivatives. The point to be made here is that 
after one propagation all the partial derivatives can be determined through local 
calculations. 
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Formula 18 presents a problem if P{A \ holds zero-entries not shared 

by P{A I n){t). This typically happens when = 0. If the propagation is lazy 
(Madsen and Jensen, 1998 [5]), this is not a problem as P{V, e) is represented as a 
set of functions, and Formula 18 consists simply in substituting one function with 
another. If traditional propagation methods are used, you can avoid the zero- 
entry problem at the cost of space by keeping the function g{V) from Formula 15. 

To calculate P(a, e), insert a as virtual evidence and do as above. So, through 
two propagations we can calculate the partial derivatives of P{a \ e). This can 
be reduced to one and a half propagation by collecting to a clique V containing 
A and distributing from V with and without a inserted as virtual evidence. 

5 The Example Revisited 

Using the technique from Section 4, we calculate for t = (0.6, 0.7, 0.4) that 
X = 0.48, P(c) = 0.832, and through two propagations we get 

P{a, c) = — t -b 1 = Os -b 0.4 = On -b 0.4, (19) 

and 

P(c) = -0.28t -b 1 = -0.24s -b 1 = -0A2u + 1. (20) 

That is, gradP(a, c) = (—1,0,0) and gradP(c) = (—0.28,-0.24,-0.42). 
Using Formula 7, we get 



grada; = (-1.04, 0.139, 0.242). (21) 

We proceed as in Section 3, get an updated t = (0.62,0.697,0.395) and 
X = 0.458. Next, two propagations provide the new coefficients for P{a,c) and 
P{c), and so forth. 

Note 1. As P{a | c) is a function of all parameters, the gradient of the distance 
is only 0 when x = y. This means that the procedure converges to the requested 
situation unless all sets of parameters satifying x = y are outside their scope (in 
the current example, all parameters must be in the unit interval). Although the 
method converges to a requested state, this state is not unique. Therefore, there 
is room for adding further constraints to the requested state. For example, we 
may add some constraints describing how easy it shall be to push the parameters. 
We shall return to this later. 

6 Neural Network Type Training 

In (feed forward) neural networks, we have a fixed set I of input variables (ob- 
served), a fixed set O of output variables (target), and a set of intermediate 
variables with relations parametrized by weights and thresholds. The network 
is trained by a set of pairs (i, o) relating an input configuration to an output 
configuration. The training method consists in gradient descent training with 
respect to distE- 
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Translated to Bayesian networks, the neural network training situation is 
the following: I and O are sets of variables in a Bayesian network BN] T, the 
training set, consists of cases. A case has the form (i, o), where i and o are 
configurations over I and O respectively. We shall call T functional if any two 
cases with the same input configuration also have the same output configuration. 
This is the situation for typical neural network classification tasks. On the other 
hand, Bayesian network classification need not be definite, and therefore T need 
not be functional. It is then called stochastic. 

Let be the different input configurations of T, and let wi,...,Wr be 

their frequencies. 

If T is functional, we let oi,...,Or denote the set of corresponding output 
configurations, and let Xi denote P(o, | ij). When training BN, we wish to 
take into account the frequencies of the cases. That is, we require the highest 
precision for the most frequent configurations. Hence, we train BN with the 
following distance distwE: 

dist^EiT) = - Xi)"^. ( 22 ) 

i 

The fact that Oj is a configuration over a set of variables rather than a single 
variable causes a small problem. Oj can be entered as virtual evidence, but as we 
cannot collect to a clique containing O, we have to perform two full propagations. 

Now, assume that T is stochastic. Let {oji, ..., 0 , 5 } denote the various output 
configurations associated with the input configuration ij, and let {yn, ...,Vis} be 
the frequencies given ij. Let Xij denote P(ojj | ij). We require the training to 
approach a situation where Xij = yij, so we train with the following distance: 

— ^^Wi{xij Uij) • 

ij 



Some entries in conditional probability tables are zero. There are in principle 
two types of zeros: “This is impossible” and “it may not be impossible, but I 
have never seen it” . 

In neural networks type training with a stochastic training set, some of the 
frequencies yij may be zero. This may be due to a poor training set, and we 
should allow these parameters to be modified. As described in Section 4, zero- 
parameters can be handled. 

Compared to neural networks, Bayesian networks have a very large set of 
simple parameters, and if they are all open for modification, the updating task 
will be very heavy and convergence may be slow (see for example Russell et 
al. [7]). Therefore, it is a good idea to reduce the number of parameters. This 
can for example be done through divorcing or independence of causal influence 
(see for example Jensen, 1996 [2]). 

A standard technique like noisy or will also reduce the number of parameters. 
However, in “noisy or” the parameters are not simple. 

Take for example the conditional table in Table 2. 
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Table 2. P{C = c | A, B) as a noisy or with the parameters t and s 



B\A 


a 




b 

-.6 


1 — ts 
1 -t 


1 — s 

0 



The three simple parameters (xi,X 2 ,X 3 ) in the table are functions of 
and instead of the partial derivatives with respect to (xi,X 2 ,X 3 ), we need the 
partial derivative of a composed function with respect to (t, s). From elementary 
calculus we have 



^ = V • — = - (24) 

dt ^ dxi dt ^ dx\ dx 2 ’ 

I 

and the requested gradient is easy to calculate from the gradient with respect 
to the simple parameters. 

This technique can also be used in situations where simple parameters covary. 
In temporal networks, for example, some simple parameters are identical or it 
may happen that an effect is damped by a constant factor from time slice to 
time slice. 

7 Resistance and Damping 

So far, we have treated methods which change the parameters in the direction of 
maximal change of the distance, and the only choice we have is either or not to 
declare a parameter modifiable. There may be reasons why we would prefer to 
change some parameters rather than others. Some parameters may for example 
be based on a large amount of known cases, others may be rather ignorant 
guesses. 

Therefore, we may attach a positive real r to each parameter expressing its 
resistance to change. 

Assume in the example in Section 3 that you are equally certain of the pa- 
rameters s and u, while you are much more certain of the parameter t. So, we 
could attach the resistance vector (4, 1, 1) to the parameters. Instead of the dis- 
placement (0.02, —0.005, —0.005), we divide each component with its resistance, 
and we get the displacement (0.005, —0.003, —0.005). As all components in the 
gradient are muliplied by a positive factor, the direction of the displacement is 
still a direction of decrease of the distance. Iterating this process seven times, 
we end with the parameter set (0.638,0.679,0.364) and P{a\c) = 0.4298. Note 
that the procedure stops in another equilibrium state with a smaller change of 
the parameter t. In general, when tuning is concerned, the high amount of pa- 
rameters will usually provide a large amount of possible states of equilibrium, 
and resistance is a way of guiding convergence in a preferred direction. 
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There are ways of quantifying resistance. You may for example express your 
uncertainty of a paramater through a virtual sample size : ’’The parameter t is 
the result of n cases with parent configuration tt out of which m had A in state 
a” . Now, assume that you observe a new case with parent configuration tt and 
A in state a. Then the parameter should change from ^ to This difference 
is If the new case does not have A in state a, the difference is That 
is, for non-extreme parameter values, the virtual sample size is a good measure 
of resistance. 

The terms (1 — t) and t in the expressions above highlight another point: 
Usually, we would like the speed of change to be damped when a simple param- 
eter approaches the extreme values 1 and 0. To achieve this, we can introduce 
damping factors which are multiplied on the components of the gradient. For 
the t-component, a damping factor could be f(l — t). Note, however, that if you 
start off having a parameter with an extreme value, then this damping factor 
will prevent it from ever leaving it. 

Frequently, the uncertainty of a parameter is expressed as an interval [x,y]. 
To exploit the techniques in this paper, you can translate the interval to a density 
function v{t) for the parameter t, and use v{t) as a damping factor. If the interval 
should be taken strict, a damping density function could be {y — t){t — x), and 
the initial value can be This damping factor yields the lowest resistance 

in the middle of the interval. Another way of translating an interval [x,y] is to 
interpret it as: ”I expect the value of t to be somewhere in the middle of the 
interval, and I am 90% sure that the value is in the interval” . That is, you have 
a distribution of t with mean close to and 90% of the density mass inside 

[x,y]- 

Take for example the interval [0.3, 0.4] for the state a of the binary variable 
A. We interpret the interval as above, and assume that the distribution is the 
result of n samples of A out of which m were in state a. The parameter t is then 
Beta-distributed with mean v = — and variance <t^ = . We seek values 

for TO and n such that v = 0.35 and a = 0.05, and we get to = 31.5 and n = 90. 

Note 2. The technique of introducing second order uncertainty into gradient 
training is different from statistical techniques. In statistical techniques, you start 
off with a prior distribution of the parameters, and when cases are absorbed, the 
disribution is modified. Using gradient training, you change the value of the 
parameter, but the prior resistance or density function used for damping is not 
changed accordingly. 

8 Adaptation 

When a system is at work, you repeatedly get new cases, and you would like to 
learn from each case. This is called adaptation. First of all, you may wish to adapt 
your system to a particular environment. Also, some conditional probabilities 
may float, and you may wish the system to follow. A way to perform adaptation 
is sequential updating. 




Gradient Descent Training of Bayesian Networks 199 



Statistical methods for sequential updating are based on the method of frac- 
tional updating ( Spiegelhalter and Lauritzen, 1990 [8] and Olesen et ah, 1992 [6]), 
which essentially is virtual counting: You start off with a virtual sample size de- 
scribing the uncertainty of a parameter, and when a new case arrives, the sample 
sizes are counted up. Afterwards, the new counts are modified in order to get 
correct means and (means of) variances. If the configuration of a variable and its 
parents is not determined completely by a case, the Bayesian network is used to 
estimate the probabilities for the possible configurations, and these probabilities 
are used for the upcounting. To take care of floating parameters, fading away 
can be used (Olesen at al. [6]). Fading away is a way of letting the influence 
from old cases decrease: Before the counts for a parameter are modified, they 
are multiplied by a factor between 0 and 1. 

Let t be a set of parameters for the Bayesian network BN. When a new case 
arrives, we would like to change the parameters in order to adapt to the case. 
Let e be a configuration of variables observed. In BN we have x = P{e), and 
as e has been observed, the probalitity for this particular set of observations 
should be increased (unless it is already 1). As described earlier, grada; can be 
computed through one propagation, and a displacement of t in the direction of 
grada; will result in the requested increase of probability. This means that when 
the case is propagated for calculating posterior probabilities, the adaptation can 
take place simultaneously. 

By using the gradient of the distance rather than gradP(e) directly, we en- 
sure that the step size is damped when we approach the requested probability. 
Furthermore, the initial virtual sample sizes used in statistical sequential updat- 
ing can be used as resistance in gradient sequential updating. 

Gradient sequential updating is in many ways similar to the fading away type 
of statistical sequential updating, and they are also quite similar with respect 
to computational comlexity. However, there are some differences. The gradient 
method can handle covarying parameters contrary to the statistical method. 
Furthermore, the gradient method can handle zero-parameters which is impos- 
sible for the statistical method unless a case gives a definite count for a zero- 
parameter. In the statistical method, on the other hand, damping/resistance is 
updated with the cases, while it is invariant during gradient updating. 

9 Conclusions 

We have presented various techniques for exploiting gradient descent training on 
Bayesian networks. The tasks have been tuning, neural network type training 
and adaptation, and it was demonstrated that less than two propagations per 
case is sufficient for performing the tasks. For adaptation, it can be done with 
one propagation and performed in connection to the propagation performed for 
the case itself. 

For Bayesian networks, the training situation in general is so that we have an 
initial network with initial conditional probabilities attached. Some of the simple 
parameters in the network have attached second order uncertainty described as 
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an initial value, a range and a resistance. Also, some parameters may covary, 
and this is described through functions over meta-parameters. Furthermore, we 
may have a training set T of cases. The variables observed in the cases may vary 
from case to case, and now we will use T for training. 

Combinations of the techniques described here can be used for this task, and 
experiments are needed on real world as well as toy examples to gain insight into 
the performance of the various techniques. 
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Abstract. We compare two approaches to the meaning of a free variable 

X in an open default (^)i ■ ■ ■ ; Mj3m{x) ^ treats x as 

'y[x) 

a metavariable for the ground terms of the underlying theory, whereas 
the second threats it as a “name” of arbitrary elements of the theory 
universe. We show that, for normal default theories, under the domain 
closure assumption, the two approaches are equivalent. In the general 
case, the approaches are equivalent in the presence of both the domain 
closure assumption and the unique name assumption. 



1 Introduction 

One of the widely used non-monotonic formalisms is Reiter’s default logic ([12]). 
This logic deals with rules of inference called default, s which are expressions of 
the form 

a{x) : M (3i{x) , . . . , M (3m{x) , . 

7(*) ’ ^ ^ 

where a(x), l3i{x), . . . , l3m{x), m > 0 , and j{x) are formulas of Hrst-order 
logic whose free variables are among x = xi,...,Xn. A default is closed if 
none of a, j3\, . . . , [3^, and 7 contains a free variable. Otherwise it is open. 
The formula a(x) is called the prerequisite of the default rule, the formulas 
l3i{x), . . . , (3rn{x) are called the ju,stifications, and the formula 7 )*) is called the 
conclusion. Roughly speaking, the intuitive meaning of a default is as follows. 
For every n-tuple of objects t = ti, . . . ,t„, if a{t) is believed, and the are 

consistent with one’s beliefs, then one is permitted to deduce ■jft) and add it to 
the “belief set.” In spite of a very clear intuition, formalizing the meaning of an 
open default is an evasive task. In this paper we compare two approaches to the 
meaning of the free variables x in the default ( 1 ). 

The Hrst approach, that to some extent can be attributed to Poole ([11]), 
treats x as metavariables for the ground terms of the underlying theory.^ In 
contrast, the second approach, due to Lifschitz ([9]), treats the free variables 
in defaults as object variables, rather than metavariables for the ground terms 

^ In fact, Poole in [If] considers only normal defaults without prerequisites, i.e., defaults 

of the form ’ ^ ■ 

/3(x) 

A. Hunter and S. Parsons (Eds.): ECSQARU’99, LNAI 1638, pp. 201-207, 1999. 
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of the underlying theory. It appears that the two approaches signihcantly differ 
from each other. Moreover, in many cases the “ground substitution” approach 
behaves in a counterintuitive fashion, see [9], [5], and Section 5. However, in 
closed domains, the “ground substitution” approach is relatively easy from a 
computational point of view, whereas the “object variable” dehnition involves a 
complicated semantical construction, see Dehnition 5 in Section 3. Thus, identi- 
fying a class of open default theories for which the dehnitions are equivalent is 
both of a practical and a theoretical interest. We show that, for normal default 
theories, under the domain closure assumption, the above two approaches are 
equivalent. In the general case, the approaches are equivalent in the presence of 
both the domain closure assumption and the unique name assumption. 

The paper is organized as follows. In the next section we recall the Her- 
brand semantics of hrst-order logic and in Section 3 we recall the dehnitions 
of the “ground substitution” and the “all object” extensions for open default 
theories. In Section 4 we explain why the domain closure assumption is required 
when one deals with the “ground substitution” approach to extensions for open 
default theories. In Section 5 we show the difference between the “ground substi- 
tution” and the “all object” dehnitions of extensions for open default theories, 
substantiate the unique name assumption requirement for the equivalence of the 
dehnitions, and state two equivalence theorems. 

2 Herbrand Semantics of First-Order Logic 

The formal dehnition of the “all object” approach to a free variable in an open 
default is based on the Herbrand semantics of hrst-order logic. We shall need 
the following notation and dehnitions. 

The language of the underlying hrst-order logic will be denoted by C. Let h 
be a set that contains no symbols of £, and let denote the language obtained 
from £, by augmenting its set of constants with all elements of h. The set of all 
closed terms of the language denoted Tc,,, is called the Herbrand universe 
of A Herbrand b- interpretation is a set of ground (closed) atomic formulas 
of Cl,. 

Let w be a Herbrand &-interpretation and let be a sentence (closed formula) 
of Cl. We say that w satisfies ip, denoted w \= ip, if the following holds. 

— If is an atomic formula, then w \= ip if and only if ip ^ w, 

— w \= Lp PP if and only if w ^ or w |= t/>; 

— w \= ~^ip if and only if w ^ ip; and 

— w \= 3xip{x) if and only if for some t ^ T c,,, w \= p{t). 

For a Herbrand &-interpretation w we dehne the C-theory of w, denoted 
Thc{w), as the set of all sentences of C satished by w. Let A be a set of 
sentences of C. We say that w is a (Herbrand) h-model of X , if X C Thc{w). 
Finally, for a class of Herbrand interpretations W we dehne the C-theory of W , 
denoted Thc{W), as the set of all sentences of C satished by all elements of W . 
That is, Thc{W) = n Thc{w). 

w£W 
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3 Default Theories and Their Extensions 

A default theory is a pair (D,A), where _D is a set of defaults and A is a set of 
first-order sentences (axioms). A default theory is elosed, if all its defaults are 
closed. Otherwise it is open. 

Definition!. ([12]) Let (D,A) be a closed default theory. For any set of sen- 
tences S let be the smallest deductively closed set of sentences B 

(beliefs) that contains A and satisfies the following condition. 

a : Ml3i,^..,Ml3m ^ a e 5 , and ^/ 3 i , . . . , ^ 5 , then 7 e 5 . 

A set of sentences E is an extension for {D, A) if (E) = E, i.e., if E is 

a fixed point of the operator r(^D Ay 

Next we recall the “ground substitution” definition of extensions for open 
default theories ([11]). This definition treats an open default as an abbreviation 
of all its closed instances. That is, a free variable in a default rule is viewed as 
a meta variable ranging over the set T c oi all ground terms of the language C. 

Definition 2. A ground (or closed) instanee of an open default 6{x) = 
a{x) : M/3i{x), ,,Ml3m{x) default 6{t) = «(*) ^ M/?i (t) . . , M/3^ (t) 

■'fix) ^ ■'fit) 

where t = ti, . . . , is a tuple of ground terms of For an open default b, the 

set of all ground instances of b is denoted by bcf, , and for a set of defaults D, 

Dc,= U bc^ is the set of all ground instances of all defaults of D. 
seD 

Poole in [11] treats an open default theory (D, A) as the closed default theory 
{Dc,A). 

Definitions. Let (D,A) be an open default theory. Extensions for the closed 
default theory iDc,A) are called ground substitution extensions for (D, A). 

Finally we recall the semantical definition of extensions for open default 
theories [5]. We start with an equivalent semantical definition of extensions for 
closed default theories (Definition 4) that motivates the definition of extensions 
for open default theories (Definition 5). 

Definition4. ([4]) Let (D, A) be a closed default theory. For any class of inter- 
pretations W let Efjj j^fiW) be the largest class V of models of A that satisfies 
the following condition. 

^ p), a ^ ThciV), and -/3i , . . . , ThciW), 

then 7 e ThciV). 

It is known from [4] (see also [9, Proposition 2]) that the definition of exten- 
sions as the theories of the fixed points of the operator E is equivalent to Reiter’s 
original definition (Definition 1). That is, a set of sentences E is an extension for 
a closed default theory (D, A) if and only if A = ThciW) for some fixed point 
W of Efd Af. 
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Now, departing from Definition 4 and following [9] and [5], we define ex- 
tensions for open default theories. We start with the intuition lying behind our 
dehnition. 

There are two types of objects in the domain of a default theory. One type 
consists of the fixed built in objects which belong to T c and must be present 
in any Herbrand interpretation, and the other type consist of implicitly dehned 
unknown objects which may vary from one Herbrand interpretation to other, e.g., 
objects introduced by existentially quantihed formulas. These objects generate 
other unknown objects by means of the function symbols of C. Thus, it seems 
natural to assume that the theory domain is a Herbrand universe of the original 
language augmented with a set of new (unknown) objects, cf. [10, Chapter 1, 

§3]. 

The following dehnition of extensions for open default theories is a rela- 
tivization of Dehnition 4 to Herbrand &-interpretations with an inhnite set of 
new constant symbols bfi The reason for passing to a semantical dehnition is 
that, in general, it is impossible to describe a Herbrand universe by means of a 
proof theory. The only exception is the case of closed domains, see Section 4. 

Definitions. ([5]) Let h be an inhnite set of new constant symbols and let 
{D, A) be a default theory. For any set of Herbrand &-interpretations W let 
^(D A)i^) largest set V of Herbrand &-models of A that satishes the 

following condition. 

For any ^ ^ ^ tuple t of elements of Tc^, 

if a{t) e ThcjV) and . . .,-./3„(t) ^ ThcjW), then j{t) £ Thc^{V). 

A set of sentences E is called a b-extension for (D,A) if E = Thc{W) 
for some hxed point W of That is, Dehnition 5 treats a free variable 

in a default as an “object variable.” It is known from [5, Theorem 42] that for 
closed default theories Dehnition 5 is equivalent to the original Reiter’s dehnition 
(Dehnition 1). 

4 The Domain Closure Condition 

In this section we explain why the domain elosure assumption is required when 
one uses the ground substitution approach to extensions for open default theo- 
ries. The requirement is illustrated by the following example. 

Example 1. ([9]) Consider an open default theory ({— — 0 )- The in- 

tuitive meaning of the default ' ^ P^x)^ that P is false whenever possible. 

Thus, in absence of any other information about P , we should conclude that it is 
false everywhere. That is, Mx^P(x) should be a consequence of 0 )- 

^ The requirement that the set of new constants b be inhnite is needed to avoid rehec- 
tion of the domain cardinality in extensions, see [f] for examples. 




Open Default Theories over Closed Domains 



205 



Indeed, for each b, ({ ' ^ }, 0) has a unique &-extension Th(\yx^P(x)'\). 
—ir[x) 

On the other hand, ({ ' ^ I; ^ unique ground substitution extension 

Th({^P(t) : i e Tc}). Obviously, 'ix^P(x) ^ Th({^P(t) : i E Tc})- 



In view of Example 1, dealing with ground substitution extensions is appro- 
priate only if the set of formulas 

( 2 ) 

is inconsistent in all extensions for the default theory under consideration. In 

{cp(t)} , 2^ 

other words, we must assure that the rule of inference — \jxLp{x) ^ 

Carnap rule, cf. [6, Section 6], is admissible in all extensions. Expressing the 
Carnap rule by defaults cannot be done in the framework of “classical” default 
logic, because the set of axioms (2) would “suppress” any set of defaults supposed 
to enforce the rule. 

It seems that the only (reasonably general) case in which the set of formulas 
(2) is inconsistent is when, for some u\, , Um E Tc, the set of axioms A of an 

open default theory {D, A) implies the formula 

m 

V* \y X = Ui, 

8 = 1 

called domain do, sure assumption. In particular, it follows from the domain clo- 
sure assumption that there are at most m pairwise distinct theory objects. 

In the rest of this paper we shall answer the question when for an open default 

m 

theories (D,A) such that A \~ 'ix \/ x = Ui, ground substitution extensions 

Z = 1 

coincide with &-extensions. 



5 Comparing 6-extensions with Ground Substitution 
Extensions 



We start this section with examples which show that the ground substitution 
approach to free variables in open defaults is, sometimes, counterintuitive. 



Example 2. Consider an open default theory [D, A) 
M^P{x),M^Q{x),MS 



where D consists of the 



default , and A consists of four axioms: P[ai), < 5 ( 02 ), 

3xR[x), and the domain closure assumption \/x[x = aid x = a’ 2 ). This default 
theory has a unique ground substitution extension Th{A). However, for the 
object implicitly introduced by the axiom 3xR{x), both ~<P and ~<Q are possible 
(not simultaneously, of course). Therefore, S must be a consequence of [D,A). 
Note that Th{A U {S'}) is a unique &-extension for {D, A) (for each h). 
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An obvious reason for which adding a new object, call it c, say,^ results in 
increasing the ground substitution extension Th{A) for (D,A) with S is that 
both c = fli and c = are possible with respect to [D,A), cf. [8]. Situations 
in which both c = ai and c = are possible would disappear if we add the 
default ' ^ , called the umque name assumption. Indeed, the default theory 

[D U { ' ^ ^ A), where D and A are as in Example 2, has a unique h- 

extension Th{A) (for each h). Note that Th{A) is a unique ground substitution 
extension for the default theory {D U { ' ^ ylj as well. 

We remind the reader that our purpose is to identify default theories for which 
ground substitution extensions coincide with &-extensions. Whereas Example 2 
overrules semi-normal default theories, i.e., default theories where the conclusion 
of a default coincides with one of the justihcations. Example 3 below shows that 
restriction to prerequisite-free defaults with only one justihcation is not sufhcient 
either. 



Example 3. Consider an open default theory (H', AU {-'S'(ai) V-'S'(a 2 )}), where 
D' consists of two defaults ' . MS[x)^ — Q{x) ^ ^ 

Example 2. This default theory has a unique ground substitution extension 
Th{A U {S'(a 2 )}) and a unique &-extension Th{A U {S'(a 2 ),T}) (for each b). 

Note that the default theory [D' U { ' ^ ^ , A) has a unique &-extension 

Th[A U {S'(a 2 )}) (for each b). 

At this stage one can still hope that ground substitution extensions for [D, A) 
coincide with &-extensions for [D U ^ }, A). However, this is not true in 

^ r y 

view of Example 4 below. 



Example Jf. Consider an open default theory [D,A), where D is as in Example 
2 and A is obtained from A in Example 2 by replacing the axiom 3xR{x) 
with R{c). The above default theory has a unique ground substitution extension 

Th{A U {S'}), whereas the default theory {D U { ' ^ A') has a unique 

X ^ y 

6-extension Th[A) (for each b). 



Two classes of default theories for which ground substitution extensions co- 
incide with &-extensions are provided by Theorems 6 and 8 below. 



Theorem 6. Let {D,A) be an open default theory sueh that 

m 



: Mx y 



G D and 



for some G Tc, A \~ \/x \/ x = Ui. Then a set of formulas E is 

i=l 

a ground substitution extension for {D, A) if and only if for some b, E is a 
b-extension for (D,A) if and only if for all b, E is a b-extension for (D,A).^ 



^ This is a kind of a Skolemization argument, cf. [f] and [12]. 

^ Cf. [6, Theorem 8.2], where ui, . . . , Um are required to be pairwise different. 
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Theorem 8 is motivated by Examples 2 and 3. To state the theorem we need 

one more dehnition. 

Definition 7. Defaults of the form called normal. A default 

P[x) 

theory [D, A) is normal, if all defaults in D are normal. 

Theorem 8. Lei (D,A) be a normal open default theory sueh that for some 

m 

Ml,..., Mm C Tc, A \- Mx \! X = Ui. Then a set of formulas E is a ground 

i=l 

substitution extension for (D,A) if and only if for some h, E is a h-extension 

for (D,A) if and only if for all h, E is a h-extension for (D,A). 
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Abstract. Shopbots are agents that search the Internet for informa- 
tion pertaining to the price and quality of goods or services. With the 
advent of shopbots, a dramatic reduction in search costs is imminent, 
which promises (or threatens) to radically alter market behavior. This 
research includes the proposal and theoretical analysis of a simple eco- 
nomic model which is intended to capture some of the essence of shop- 
bots, and attempts to shed light on their potential impact on markets. 
Moreover, experimental simulations of an economy of software agents are 
described, which are designed to model the dynamic interaction of elec- 
tronic buyers, sellers, and shopbots. This study forms part of a larger 
research program that aims to provide new insights on the impact of 
agent and information technology on the nascent information economy. 



1 Introduction 

Shopbots, agents that automatically search the Internet for goods and services 
on behalf of consumers, herald a future in which autonomous agents become an 
essential component of nearly every facet of electronic commerce [3,8, 12,5]. In 
response to a consumer’s expressed interest in a specified item, a typical shopbot 
can query several dozen web sites, and then collate and sort the available infor- 
mation for the user — all within seconds. For example, www . shopper . com claims 
to compare 1,000,000 prices on 100,000 computer-oriented products! In addition, 
www.acses.com compares the prices and expected delivery times of books of- 
fered for sale on-line, while www.jango.com and webmarket.junglee.com offer 
everything from apparel to gourmet groceries. Shopbots can out-perform and 
out-inform even the most patient, determined consumers, for whom it would 
take hours to obtain far less coverage of available goods and services. 

Shopbots deliver on one of the great promises of electronic commerce and 
the Internet: a radical reduction in the cost of obtaining and distributing infor- 
mation. It is generally recognized that freer flow of information will profoundly 
affect market efficiency, as economic friction will be reduced significantly [1,6,9, 
4] . Transportation costs, menu costs — the costs to firms of evaluating, updating, 
and advertising prices — and search costs — the costs to consumers of seeking 
out optimal price and quality — will all decrease, as a consequence of the digital 
nature of information as well as the presence of autonomous agents that find, 

A. Hunter and S. Parsons (Eds.): ECSQARU’99, LNAI 1638, pp. 208-220, 1999. 
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process, collate, and disseminate that information at little cost. What are the 
implications of the widespread use of shopbots? Specifically, do shopbots have 
the potential to increase social welfare? If so, how can shopbots adequately price 
their services so as to provide consumers with incentives to subscribe, while 
retaining profitability? More generally, what is the expected impact of agent 
technology on the nascent information economy? 

Previous work in economics on the impact of search costs on equilibrium 
prices was oriented towards explaining the phenomenon of price dispersion in 
social economies; see, for example, [11,13,2]. In such work, an attempt is made 
to approximate human behavior with mathematical functions or algorithms, and 
under the relevant assumptions, collective behavior and equilibria are studied. 
In contrast with previous intentions, our mission is to investigate the possible 
dynamics of the future information economy in which software agents, rather 
than human constituents, will play the key role. Consequently, we take mathe- 
matical functions and algorithms a good deal more seriously, by regarding them 
as precise specifications of the behavior of economic players. In this paper, we 
focus on the likely effect that one particular specification of a class of agents, 
namely shopbots, will have on electronic markets. From this study, we hope to 
gain insights into the design of adaptive algorithms for economically- motivated, 
computational agents which successfully maximize utility. 

This paper is organized as follows. The next section presents our model of a 
simple market in which shopbots provide price information, which is analyzed 
from a game-theoretic point of view in Section 3. In Section 4, we consider the 
dynamics of interaction among software agents designed to model electronic con- 
sumers and producers; moreover, we investigate the effect of non-linear search 
costs (Section 4.1) and irrational consumers (Section 4.2) via experimental sim- 
ulations. Finally, Section 5 presents our conclusions and ideas for future work. 

2 Model 

We consider an economy in which there is a single commodity that is offered 
for sale by S sellers and of interest to B buyers. Periodically, at a rate p^, a 
buyer b attempts to purchase a unit of the commodity. Each attempted purchase 
proceeds as follows. First, buyer b conducts a search of fixed sample size i, which 
entails requesting 0 < i < 5 price quotes.^ A search mechanism (which could 
be manual or shopbot-assisted) instantly provides price quotes for i randomly 
chosen sellers. Buyer b then selects a seller s whose quoted price ps is lowest 
among the i (ties are broken randomly), and purchases the commodity from 
seller s if and only if ps < v^, where Vf, is buyer 6’s valuation of the commodity. 

In addition to the purchase price, buyers may incur search costs. The cost Ci 
of using search strategy i, however, does not enter into the purchasing decision 
of the buyers, because buyers must commit to conducting a search before the 
results of that search become available. In other words, search payments are 

^ We permit a search strategy of 0 to allow buyers to opt out of the market entirely, 
which may be desirable if search costs are prohibitive. 
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sunk costs. Instead, search costs affect the choice (0 < i < 5) of search strategy 
utilized by buyers. A buyer b is assumed to periodically re-evaluate its strategy at 
a rate < Pb, where typically, ah Pb- Upon re-evaluation, the rational buyer 
estimates a price pi that it expects to pay for the commodity if it uses strategy 
i, and selects the strategy j that minimizes pj + Cj , provided that pj + Cj < Vb- 
If this condition is not satisfied, then j = 0: i.e., the rational buyer does not 
search and does not participate in the market at that time. 

The buyer population at a given moment is characterized by a strategy vector 
w, in which the component Wi represents the fraction of buyers employing strat- 
egy i and X)f=o 'Wi = 1. A seller s’s expected profit per unit time is a function 
of the strategy vector w, the price vector p describing all sellers’ prices, and the 
cost of production for seller s. In particular, 7rs(p,tu) = Ds{p,w)[ps — rs). 
where Ds{p,w) is the rate of demand for the good produced by seller s, given 
the current price and search strategy vectors. The demand Ds{p, w) is the prod- 
uct of (i) the overall buyer rate of demand p = Y^bPbi (ii) ^^e likelihood that 
seller s is selected as a potential seller, denoted hs{p,w), and (iii) the frac- 
tion of buyers whose valuations satisfy Vb > Ps, denoted g{ps)- Specifically, 
Ds{p,w) = pBhs{p,w)g[ps). Without loss of generality, we define the time 
scale such that pB = 1. Then we can interpret as seller s’s expected profit 
per unit sold systemwide. 

The likelihood of a given buyer selecting seller s as their potential seller, 
namely hs{p,w), depends on the search strategies of the buyers. In particular, 
this term is the sum over all buyer types of the fraction of the buyer population 
of type i times the probability hs^i{p) that seller s is selected by a buyer of type 
i, namely hs[p,w) = X)f=o quantities hs i[p) are investigated in 
detail in the following section. Finally, the value g{ps) = 'y[x)dx, where 'y(x) 
is the probability density function describing the likelihood that a given buyer 
has valuation x. For example, if all buyers have the same valuation v, i.e., Vb = v, 
then ■y(x) is the Dirac delta function S(v — x), and the integral yields a step 
function g(ps) = &{v — Ps), equal to 1 when ps < v and 0 otherwise. Assuming 
all buyers have equal valuations v,^ and all sellers share the same cost r, the 
profit function can now be expressed as follows: 7Ts(p, tu) = hs{p,w)[ps — r), if 
Ps < V, but otherwise, 7Ts(p, tu) = 0. 

3 Analysis 

In this section, we present a game-theoretic analysis of the prescribed model, 
assuming sellers are rational (i.e., utility maximizers). Initially, we focus entirely 
on the strategic decision-making of rational sellers, by assuming the distribution 
of the buyer population is fixed and exogenously determined. Later, we extend 
our analysis to rational buyers, thereby permitting w to vary. 

A Nash equilibrium is a vector of prices at which sellers maximize their 
individual profits and from which they have no incentive to deviate [10]. There 

^ In this case, w can be interpreted as representing a mixed search strategy of a single 
buyer who creates all of the demand in the system. 
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are no pure strategy Nash equilibria for this model [7]. There does, however, exist 
a symmetric Nash equilibrium in mixed strategies, which we derive presently. Let 
/(p) denote the probability density function according to which sellers set their 
equilibrium prices, and let F[p) be the corresponding cumulative distribution 
function. In the range for which it is defined, F[p) has no mass points, since 
otherwise a seller could decrease its price by an arbitrarily small amount and 
experience a discontinuous increase in profits. Moreover, there are no gaps in 
the distribution, since otherwise prices would not be optimal — a seller charging 
a price at the low end of the gap could increase its price to fill the gap while 
retaining its market share, thereby increasing its profits. 

The cumulative distribution function F[p) is computed in terms of the prob- 
ability hs{p, w) that buyers select seller s as their potential seller. This quantity 
is the sum of hs^i{p) over 0 < i < 5. The first component /is,o(p) = 0. Consider 
the next component Buyers of type 1 select sellers at random; thus, the 

probability that seller s is selected by such buyers is simply = 1/5. Now 

consider buyers of type 2. In order for seller s to be selected by a buyer of type 2, 
s must be included within the pair of sellers being sampled — which occurs with 
probability (5 — l)/(f) =2/5 — and s must be lower in price than the other 
seller in the pair. Since, by the assumption of symmetry, the other seller’s price is 
drawn from the same distribution, this occurs with probability 1 — F[p). There- 
fore /is, 2 (p) = (J^/S) [1 — F[p)]. In general, seller s is selected by a buyer of type 
i with probability (fri ) / (f ) = and seller s is the lowest-priced among the 
i sellers selected with probability [1 — F[p)Y~^, since these are i — 1 independent 
events. Thus, /is i(p) = (i/5)[l — and^ /is(p) = X)f=i 

The precise value of F[p) is determined by noting that a Nash equilibrium in 
mixed strategies requires that all pure strategies that are assigned positive prob- 
ability yield equal payoffs, since otherwise it would not be optimal to randomize. 
In particular, the expected profits earned by seller s, namely tTs{p) = hs{p){p—r), 
are constant for all prices p. The value of this constant can be computed from 
its value at the boundary p = v; note that F[v) = 1 because no rational seller 
charges more than any buyer is willing to pay. This leads to the following rela- 
tion: hs{p){p — r) = hs{v){v — r) = — r). Now solving for p in terms of F 

yields: 

Eq. 1 has several important implications. First of all, in a population in which 
there are no buyers of type 1 (i.e., wi = 0) the sellers charge the production 
cost c and earn zero profits; this is the traditional Bertrand equilibrium. On the 
other hand, if the population consists of just two buyer types, 1 and some i 1, 
then it is possible to invert p{F) to obtain: 




F{p) = 1 




( 2 ) 



^ In the final equation, hs(p) is expressed as a function of seller s’s scalar price p, since 
we average over all other components of the price vector. 
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The case in which i = S was studied previously by Varian [13]; in this model, 
buyers either choose a single seller at random (type 1) or search all sellers and 
choose the lowest-priced among all sellers (type S). 

Since F[p) is a cumulative probability distribution, it is only valid in the 
domain for which its valuation is between 0 and 1. As noted previously, the 
upper boundary is p = v] the lower boundary p* can be computed by setting 
F[p*) = 0 in Eq. 1, which yields: 



* I 

p =r + 



tni(r; — r) 
Ei=i 



( 3 ) 



In general, Eq. 1 cannot be inverted to obtain an analytic expression for F[p). 
It is possible, however, to plot F[p) without resorting to numerical root finding 
techniques. We use Eq. 1 to evaluate p at equally spaced intervals in E G [0, 1]; 
this produces unequally spaced values of p ranging from p* to r;. 

We now consider the probability density function /(p). Differentiating both 
sides of the equation hs{p){p — r) = — r), we obtain an expression for f{p) 

in terms of E(p) and p that is conducive to numerical evaluation: 



f{p) 



wi[v — r) 

{P - ry - F{p)Y~^ 



( 4 ) 



The values of /(p) at the boundaries p* and v are as follows: 



/(/) = 





Ei=i 


2 


w^i(^^ — r 


•) 


Ef=2*( 


i — l)wi 



and f{v) = 



Wl 



2 w 2 {v — r) 



( 5 ) 



Fig. 1(a) and 1(b) depict the PDFs in the prescribed model under varying 
distributions of buyer strategies — in particular, wi = 0.2 and W 2 + ws = 0.8 
— when 5 = 5 and 5 = 20, respectively. In both figures, /(p) is bimodal when 

= 0, as is derived in Eq. 5. Most of the probability density is concentrated 
either just above p* , where sellers expect low margins but high volume, or just 
below V, where they expect high margins but low volume. In addition, moving 
from 5 = 5 to 5 = 20, the boundary p* decreases, and the area of the no- 
man’s land between these extremes diminishes. In contrast, when W 2 ,ws > 0, a 
peak appears in the distribution. If a seller does not charge the absolute lowest 
price when W 2 = 0, then it fails to obtain sales from any buyers of type 5. 
In the presence of buyers of type 2, however, sellers can obtain increased sales 
even when they are priced moderately. Thus, there is an incentive to price in 
this manner, as is depicted by the peak in the distribution. The case in which 
Ws = 0: i.e., -k = 1 is explored in more detail in the next section. 

Recall that the profit earned by each seller is [l/S)wi[v — r), which is strictly 
positive so long as > 0. It is as though only buyers of type 1 are contributing 
to sellers’ profits, although the actual distribution of contributions from buyers of 
type 1 vs. buyers of type i > 1 is not as one-sided as it appears. In reality, buyers 
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(b) 20 Sellers 



Fig. 1. PDFs for wi = 0.2 and W 2 + W 20 = 0.8. 



of type 1 are charged less than v on average, and buyers of type i > 1 are charged 
more than r on average, although total profits are equivalent to what they would 
be if the sellers practiced perfect price discrimination. In effect, buyers of type 1 
exert negative externalities on buyers of type i > 1, by creating surplus profits 
for sellers. 



3.1 Endogenous Buyer Decisions 

Heretofore in our analysis, we have assumed rational decision-making on the part 
of the sellers, but an exogenous distribution of buyer types. It is also of interest 
to consider buyers as rational decision-makers, with the cost C{ of comparing 
the prices of i sellers defined explicitly, thereby giving rise to endogenous search 
behavior. As mentioned previously, rational buyers estimate the commodity’s 
price Pi that would be obtained by searching among i sellers, and select the 
strategy i* that minimizes pi + Ci, provided that pi + Ci < v^] otherwise, the 
buyer does not search and does not participate in the marketplace. 

Before studying the decision-making processes of individual buyers, it is use- 
ful to analyze the distributions of prices paid by buyers of various types and their 
corresponding averages at equilibrium. Recall that a buyer who obtains i price 
quotes pays the lowest of the i prices. (At equilibrium, the sellers’ prices never 
exceed v since F[v) = 1, so a buyer is always willing to pay the lowest price.) The 
cumulative distribution for the minimal values of i independent samples taken 
from the distribution /(p) is given by Ti(p) = 1 — [1 — E(p)]®. Differentiation 
with respect to p yields the probability distribution: yi{p) = i/(p)[l — F[p)Y~^. 
The average price for the distribution yi{p) can be expressed as follows: 

= P*+ f dF 
Jo 

where the first equality is obtained via integration by parts, and the second 
depends on the observation that dp/dF = [dF / dp\~^ = j-. Combining Eqs. 1, 
4, and 6 would lead to an integrand expressed purely in terms of F . Integration 




{l-FY 



( 6 ) 
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over the variable F (as opposed to p) is advantageous because F can be chosen 
to be equispaced, as standard numerical integration techniques require. 

Fig. 2(a) depicts sample price distributions for buyers of various types: yi{p), 
j/ 2 (p), and j/ 2 o(p), when 5 = 20 and (wi,W 2 ,W 2 o) = (0.2, 0.4, 0.4). The dashed 
lines represent the average prices pi for i G {1,2, 20} as computed by Eq. 6. The 
blue line labeled Search-1 , which depicts the distribution yi{p), is identical to 
the green line labeled W 2 = 0.4 in Fig. 1(b), since yi{p) = f{p)- In addition, 
the distributions shift toward lower values of p for those buyers who base their 
buying decisions on information pertaining to more sellers. 

Fig. 2(b) depicts the average buyer prices obtained by buyers of various 
types, when wi is fixed at 0.2 and W 2 + W 20 = 0.8. The various values of i (i.e., 
buyer types) are listed to the right of the curves. Notice that as W 20 increases, 
the average prices paid by those buyers who perform relatively few searches 
increases rather dramatically for larger values of tr; 2 o- This is because wi is fixed, 
which implies that the sellers’ profit surplus is similarly fixed; thus, as more and 
more buyers perform extensive searches, the average prices paid by those buyers 
decreases, which causes the average prices paid by the less diligent searchers to 
increase. The situation is slightly different for those buyers who perform larger 
searches but do not search the entire space of sellers: e.g., i = 10 and i = 15. 
These buyers initially reap the benefits of increasing the number of buyers of 
type 20, but eventually their average prices increase as well. Given a fixed portion 
of the population designated as buyers of type 1, Fig. 2(b) demonstrates that 
searching S sellers is a superior buyer strategy to searching 1 < i < 5 sellers. 
Thus, there is value in performing price searches: shopbots offer added value in 
markets in which there exist buyers who shop at random. This observation leads 
us directly into a discussion of explicit buyer search costs. 




p 



(a) PDFs 




Fig. 2. (a) Buyer price distributions for 20 sellers, with wi = 0.2, W 2 = 0.4, W 20 = 0.4. 
(b) Average buyer prices for various buyer types; 20 sellers, wi = 0.2, W2 + W20 = 0.8. 



Initially, we model buyer search costs following Burdett and Judd [2], who 
assume costs are linear in the number of searches; in particular, Ci = ci +J(i — 1), 
where ci,5 > 0 are, respectively, fixed and variable costs of obtaining price 
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quotes. Moreover, we assume buyers are rational decision-makers who strive to 
minimize overall expenditure, and who use pi (as in Eq. 6) as an estimate of 
pi. Thus, an optimal buyer strategy i* satisfies: i* G arg mino<i <5 Pi + c^. At 
equilibrium, 0 < < 1, since if wi = 0, then all buyer perform some degree of 

search, in which case all sellers charge the competitive price r (see Eqs. 2 and 
3), from which it follows that it is in fact not rational for buyers to search at all, 
leading to the contradiction that mi = 1. Now since the buyer cost function pi+Ci 
is convex, it is minimized at either a single integer value i * , or two consecutive 
integer values i* and i* + 1. Thus, at equilibrium, either mi = 1, in which case 
all sellers charge the monopolistic price v, or mi -f m 2 = 1 and the sellers’ prices 
are given by the distribution f{p). ^ 

In the case where mi -f m 2 = 1, we can obtain analytic expressions for the 
average prices seen by buyers of types 1 and 2: 



Pi=P + 



(- 1 +”-) (i&+i<>8(he)) 



2 m 2 



[v — r) 



P2=P* + 



(1 - m 2 ) (2 m 2 + (1 - m2) log ( 1 +^) ) 



2 w 2 ^ (1 + m 2 ) 



[v — r) 



(7) 

( 8 ) 



Fig. 3(a) plots pi (i.e., Search-1) and p 2 (i.e., Search-2) as a function of m 2 . 
Not surprisingly, these curves are downward sloping, which reflects the fact that 
price decreases on average as the degree of search increases. 

Fig. 3(b) plots the marginal cost of obtaining only one price quote rather than 
searching for two. More specifically, this figure displays pi — p 2 as a function 
of m 2 . Notice that there exist <5 > 0 such that pi = p 2 + S. In the diagram, 
S is arbitrarily set at 0.02. The points of intersection between the marginal 
cost curve and S = 0.02 represent the points at which buyers are indifferent 
between obtaining a single price quote and obtaining two price quotes at price 
S, but purchasing the commodity at the lower price of the two. In other words, 
there are two equilibria on the curve, indicated by the colored circles. Above 
the dotted line, the marginal cost is greater than <5; thus, it is advantageous to 
search and there is momentum in the rightward direction. On the other hand, 
below the dotted line, the marginal cost is less than S, and it is therefore more 
desirable not to search; hence, there is momentum in the leftward direction. 
Following the direction of the arrows, we observe that the open circle represents 
an unstable equilibrium, while the filled-in circle that falls on the curve is a 
stable equilibrium. In addition, there is a second stable equilibrium in the lower 
left-hand corner of the graph (indicated by a second filled-in circle) where mi = 1 
and the equilibrium price is the monopolistic price v. The unstable equilibrium 
represents a boundary between two basins of attraction: initial values of m 2 
greater than this will migrate towards the equilibrium near m 2 = 1, while those 
less than this will migrate towards mi = 1. 



This depends on the assumption that ci is sufficiently small such that wo = 0. 
Otherwise, the equilibria which arise are such that mi = 1 — mo or mi -|- m 2 = 1 — mo . 




216 Jeffrey O. Kephart and Amy R. Greenwald 







0,0 0,2 0.4 0.6 0.8 1,0 



0.0 0.2 0.4 0.6 0.8 1.0 

W2 



W2 

(a) Average prices for buyers. 



(b) Marginal cost vs. W 2 . 



Fig. 3. An economy of buyers of type 1 and 2. 



4 Shopbot Experiments 

In order to explore the likely effect of shopbots on market behavior, we consider 
three distinctive characteristics of shopbots in turn, focusing on how they affect 
search costs and buyers’ strategies. 

First of all, a typical shopbot such as the one residing at www.acses.com 
permits users to choose the number of sellers among whom to search. Since the 
service is free to buyers at present, and since the search is very fast (acses 
searches prices at 25 book retailers within about 20 seconds), there is only a 
very mild disincentive to requesting a large number of price quotations. Thus, 
the effective search cost is only weakly dependent on the number of searches. 
One way to model weak dependence on the number of searches is via a nonlinear 
search cost schedule: Cj = ci + 6{j — 1)°‘ , where the exponent a is in the range 
0 < a < 1. Note that a = 1 yields the linear search cost model, while a = 0 
yields a search cost that is independent of the number of searches for j > 1. 

Second, today’s shopbots are used by only a small fraction of shoppers. This 
is due at least in part to the fact that many potential users are unaware of the 
existence of shopbots, and others do not know where to find them or how to 
use them. One way of modeling buyers who do not use shopbots is to assume 
that such uninformed or “irrational” users are buyers of type 1, for which they 
incur fixed cost ci. This establishes a lower limit on the fraction which we 
denote In particular, represents the fraction of uninformed buyers 

who guarantee the sellers a strictly positive profit surplus. In the following two 
subsections, we explore these issues in greater detail. 

4.1 Nonlinear search costs 

Suppose that buyers periodically (at random times) re-evaluate their search 
strategies and choose the strategy j that minimizes pj + Cj, where pj is their 
estimate of the average price they are likely to get by using search strategy j. 
One possibility is that the buyer (or an agent acting on the buyer’s behalf) could 
use historical data on sellers’ prices to estimate pj. However, we shall assume 
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here that the buyers are perfectly knowledgeable about the sellers’ marginal 
production cost r and the current state of the strategy vector w, and thus they 
can integrate Eq. 6 numerically to compute pj = pj. As the buyers modify their 
strategies in this manner, we assume further that the sellers monitor w and 
instantaneously re-compute the symmetric price distribution f(p) and choose 
their prices according to this distribution. 

We can approximate this evolutionary process by a discrete time process 
in which, at each time step, a fraction e of the buyer population is given the 
opportunity to switch to the optimal strategy. Then the strategy vector evolves 
according to: Wi(t -|- 1) = Wi(t) + e(Sij — Wi(t)), where j is the strategy that 
minimizes pj + Cj and 6 ij represents the Kronecker delta function, equal to 1 
when i = j and 0 otherwise. 

Fig. 4(a) illustrates the evolution of the components of tu in a 5-seller system 
when wi is completely endogenous = 0, and the search costs are linear 

(a = 1, Cl = 0.05, and S = 0.02). The value of e is 0.005. Recall that according 
to Burdett and Judd [2], w must evolve toward an equilibrium consisting of a 
finite number of type 1 and type 2 buyers. Indeed, this does occur, but what is 
most interesting is the trajectory of the w on its route toward equilibrium. 





Fig. 4. (a) Evolution of indicated components of buyer strategy vector w for 5 sellers, 
with linear search costs c, = 0.05 -|- 0.02(i — 1). Final equilibrium oscillates with small 
amplitude around theoretical solution involving a mixture of strategy types 1 and 2. 
(b) Evolution of indicated components of buyer strategy vector w for 5 sellers, with 
nonlinear search costs c, = 0.05-|-0.02(i — 1)° . Final equilibrium oscillates chaotically 
around a mixture of strategy types 1, 2, and 3. 



Initially, wq = (0.2, 0.3, 0.0, 0.0, 0.5). In this situation, the favored strategy 
is type 3, and so W 3 begins to grow at the expense of tci, W 2 and W 5 . However, 
as W 5 diminishes, the total amount of search in the system diminishes, and f(p) 
flattens and shifts in such a way that eventually the favored strategy shifts from 
3 to 2. Thereafter, W 2 grows at the expense of W 3 and the other components. In 
this simulation, near but imperfect equilibrium is achieved: due to the finite size 
of € (equal to 0.005), there are small oscillations in W 2 around an average value 
that is close to the theoretical value of 0.9641721. This value can be derived by 
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identifying the value of W 2 corresponding to <5 = 0.02 in Fig. 3(b). In Fig. 3(b), 
there is a second value of W 2 satisfying S = 0.02, near W 2 = 0.1375564. However, 
this is the unstable equilibrium, and as discussed in the previous section it marks 
the boundary between two basins of attraction, one in which the final equilibrium 
is (mi , W 2 ) = (0.0358279, 0.9641721), and the other in which (mi , m 2 ) = (1,0). 

The derivation of an equilibrium in which only type 1 and type 2 strategies 
could co-exist was founded on the assumption that search costs are linear in the 
amount of search. In order to investigate the effect of nonlinear search costs that 
grow only weakly with the amount of search, we run the same experiment, in 
which all parameters are identical except for the exponent a, which is reduced 
from 1.0 to 0.25. Fig. 4(b) depicts the result. Interestingly, in this case the 
system evolves to an equilibrium in which types 1, 2 and 3 co-exist: w oscillates 
around the value (0.0217, 0.5357, 0.4426, 0.0000, 0.0000) in a way that appears to 
be chaotic, but it remains to conduct further tests of this phenomenon. While the 
chaotic oscillations are an artifact of the finite size of e, and would disappear in 
the limit e — ;> 0, they hint that the system would undergo large-scale nonlinear 
and possibly chaotic oscillations if the buyers were to revise their strategies 
synchronously rather than asynchronously. 



4.2 Lower limit on wx 

In order to explore the consequences of some proportion of users failing to 
adopt low-cost search methods (perhaps due to ignorance about their exis- 
tence or about how to use them), we now impose a lower limit on mi, de- 
noted [mij. Fig. 5(a) depicts the result of imposing [mij = 0.04, with linear 
search costs C{ = 0.05 -f 0.005(i — 1). Starting from an initial strategy vector 
wq = (0.04, 0.20, 0.00, 0.00, 0.76), the system evolves to an equilibrium in which 
only types 1 and 4 co-exist, with mi = 0.04 and W 4 = 0.96. 





Wo[2] 



Fig. 5. (a) Evolution of indicated components of buyer strategy vector w for 5 sellers, 
with linear search costs c, = 0.05 -|- 0.02(i — 1) and [mij = 0.04. Starting from the 
initial w indicated in the text, the strategy vector evolves towards an equilibrium in 
which only types 1 and 4 are present, (b) Two-dimensional cross-section of basin of 
attraction for ([mij ,5) = (0.04,0.005). 
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In numerous experiments with linear search costs, we have observed that the 
final equilibrium always consists of a mixture of types 1 and i, where i is not 
necessarily 2, as it must be when wi is determined in an entirely endogenous 
fashion. The strategy i depends on the values of and 6. Table 1 illustrates 
the dependence of the strategy i that mixes with strategy 1 upon and the 

incremental cost 6. Higher values of lead to higher equilibrium strategies i 
(more extensive search) while higher incremental costs S lead to lower equilibrium 
strategies i (less extensive search). For the table entries ([ti^ij , <5) = (0.04, 0.005) 
and ([ti^ij , <5) = (0.20, 0.020), multiple equilibria are obtained. In these cases, the 
initial setting of the strategy vector determines which equilibrium is obtained. 

The effect of initial conditions on equilibrium selection in the case ( J , <5) = 
(0.04, 0.005) is illustrated in Fig. 5(b). Four equilibria are possible, all of the form 
wi + Wi = 1, for i = 2, 3, 4, 5. The set of initial conditions leading to equilibrium 
i — its “basin of attraction” — forms a contiguous, smoothly bounded region, 
a two-dimensional cross-section of which is depicted in Fig. 5(b). 



[Wlj 


S = 0.001 


S = 0.005 


S = 0.020 


0.01 


5 


2 


2 


0.04 


5 


2-5 


2 


0.20 


5 


5 


2-3 



Table 1. Search strategy or strategies that co-exist with type 1 search strategy, as a 
function of [wij and incremental cost S. 



5 Conclusions and Future Work 

Our desire to explore the economic impact of shopbots in obtaining price and 
product information has led us to a model that is similar in spirit to those 
that have been investigated by economists interested in understanding the phe- 
nomenon of price dispersion. Our goals, however, are prescriptive, rather than 
descriptive, leading us to consider somewhat different causes and effects than 
are typical of price dispersion studies. Ultimately, we are interested in designing 
economically- motivated software agents, as well as an infrastructure that will 
support their interactions; thus, we have emphasized the constructive computa- 
tion of price distributions and averages, rather than merely providing classical 
proofs of existence and other properties of equilibria. 

Arguing that nonlinear search cost schedules are likely to exist naturally, or 
might even be adopted intentionally by shopbots, we studied their effect within 
the context of our model; our findings reveal that nonlinear search costs can lead 
to more complicated mixtures of buyer strategies and more extensive search than 
occur with linear costs. Another practical assumption, namely the existence of 
a positive number of uninformed buyers who do not use search mechanisms. 
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can lead to similar outcomes. Taking evolutionary dynamics of buyer strategies 
into account, we found that the final equilibrium strategy vector depends on its 
initial value, and the route toward equilibrium can be surprisingly complicated. 

In closing, we briefly mention two promising areas for future work. First, com- 
bining the evolutionary dynamics of buyers with more interesting and realistic 
models for seller pricing behavior such as those described in [7,8] would be of 
practical importance, and are certain to lead to interesting dynamics. Secondly, 
since shopbots are beginning to provide additional information about product 
attributes, it would also be of interest to analyze and simulate a model that 
accounts for both horizontal [1] and vertical differentiation. 
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Abstract. In this paper, we present an algorithm for learning the most probable 
structure of a Bayesian Network from a database of cases. Starting from two 
previous algorithms, K2 of Cooper and Herskovits, and B of Buntine, we 
developed a new algorithm that relaxes the assumption of total ordering on the 
nodes needed by K2 and has less computations than B. To improve our 
algorithm, we added some heuristics and an interactive process with the user. 



1 Introduction 

Bayesian Belief Networks (BBN) are well used to represent problems dealing with 
uncertainty. With its Directed Acyclic Graph structure, BBN can capture the 
dependence and independence relationships among variables in a given domain of 
knowledge. BBN can reduce the number of probabilities needed and the number of 
computations because for each node, we should only specify its probability 
conditioned on the values of its parents. BBN were constructed with success in 
different application areas (medical diagnoses [Beinlich, 89], [Berzuini, 91], oil price 
reasoning [Abramson, 91], agriculture [Michalski, 88], ...) and were especially used 
for propagating new information and updating new probabilities [Pearl, 88], 
[Lauritzen, 88]. 

A major problem when using BBN lies in the difficulty of constructing them. In 
addition, this task is still time-consuming especially when we deal with a complex 
domain. The problem occurs also when experts are rare or not available. In that case, 
it is difficult to obtain all the dependence and independence relationships among the 
variables. Therefore, it would be very interesting if the construction process of the 
structure could be fully or partly automated. In fact, it would be useful to extract the 
structure from databases which become more and more available in all domains 
(Science, engineering, business...). The networks generated from a database can be 
used directly in a decision making process or used as a starting point ready for 
modification by an expert. In both cases, it is surely less-time consuming than when it 
is generated manually [Herskovits, 91], [Cooper, 91], [Heckerman, 94], [Bouckaert, 
94], [Buntine, 94], [Singh, 95]. 
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In addition, when using an interactive process, the expert is not completely by-passed. 
He guides the operation by giving his opinion when he is sure about the presence or 
not of a link, and by correcting the structure if necessary. So, one part of the structure 
is given by the expert and the other one is detected from the database. 

The starting point of the new algorithm, proposed in this paper could be summarized 
in the two well known algorithms, K2 of Cooper and Herskovits, and B of Buntine. 
K2 algorithm has as input the set (V) of n nodes, a database (D) of m cases, an 
ordering on the nodes and an upper limit on the number of parents that a node can 
have (u). After exploring the database, K2 provides as output the parent set of each 
node composed of the most probable links between the node and all its predecessors. 
It has been proved that K2 produced very interesting and encouraging results, but the 
main drawback of this method is of course the total ordering requirement on all the 
nodes which is very difficult to be obtained correctly from an expert [Cooper, 91]. 

On the other hand, algorithm B does not require an a priori ordering. It just needs as 
input the set V of the nodes and the database D of m cases. Like in K2, in order to 
provide the parent set of each node, algorithm B explores the database looking for the 
most probable links, but this time not only with the predecessors (there is no 
ordering), but with all the nodes. Its major drawback is the time of computations 
because it requires comparing all the links. Therefore, as the number of nodes grows, 
the computational complexity increases [Buntine, 91]. 

Our new algorithm, called K2B, does not require a total ordering on the nodes 
because as it has already been said, it is very difficult for the expert to give such an 
ordering especially when the number of nodes is large. However, the expert can give 
his opinion about the ordering of some subsets of variables that we name clusters. 
Therefore, once the clusters are formed, we apply algorithm K2 within each one. 
Several structures can be obtained (one per cluster). The next step consists of 
applying algorithm B between variables in different clusters in order to be able to link 
the different structures detected previously. Thus, the algorithm B is not applied on 
all the variables but just between the clusters. It enables us to save a great number of 
computations. Our algorithm is quite different from K2, since we just introduce the 
ordering on the variables that the expert is sure of. On the other hand, it is also 
different from B, since the number of iterations is reduced by determining the inter- 
cluster links only. We have established some heuristics in order to improve our 
algorithm and to reduce the number of computations without diminishing the 
efficiency of K2B. 

In this paper, we propose this new algorithm that extracts a Bayesian Network 
structure from a database of cases. In Section 2, we start by describing the two basis 
algorithms which are K2 of Cooper and B of Buntine. In Section 3, we present the 
new proposed solution in detail, with some heuristics that we added to improve the 
efficiency of our algorithm. It is also devoted to the organigrams and the structure of 
the algorithm. Finally, Section 4 contains an example of the process. 
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2 Background 



2.1 K2 Algorithm [Cooper, 91] 

Cooper and Herskovits proposed a method called Bayesian Learning of belief 
Network (BLN) in order to generate the most probable structure from a database of 
cases and they proved the following theorem: 



Consider a set V of n discrete variables. Each variable Xj E V has r, possible value 
assignments : (v,,,..., v,J. Let D be a database of m complete cases (each case contains 
a value assignment for each variable in V). Let denotes a belief network structure 
containing just the variables in V. Each variable x- in Bj. has a set of parents 3t|. Let 
w~ denote the j"" instantiation of n. in an arbitrary ordering of all distinct instantiations 
of Jt, relative to D, and suppose there are q. such unique instantiations of Let be 
the number of cases in D where x,. is instantiated to v,^ while it, is instantiated to W;.. 
Njj = 5!k., Let u be the maximum number of parents that a node can have. 

Cooper and Herskovits proved their theorem based on four assumptions, which are: 

1. the database variables are discrete; 

2. cases occur independently, given a belief network model; 

3. all variables are instantiated to a certain value for every case; 

4. before observing the database, we are indifferent regarding the numerical 
probabilities to place on the belief network structure. 



The statement of the theorem is [Cooper, 91] : P(Bj, D) = P(Bs) n..,„ g(x,, Ji;), 
where g(Xj, is given by : 






(ri-l)! 

(Nijfri-1)! 




( 1 ) 



This result is used to compute the probability of a structure given a database. 
Consequently, it allows to find the most probable one after comparing them. But, 
since the number of possible structures grows exponentially as the number of 
variables increase, it becomes impossible to find the most probable structure given 
the database, by exhaustively enumerating all possible structures (example : with 2 
variables, we have 4 structures and with 3 variables, 25 structures). 

To avoid this problem, Cooper and Herskovits used two more assumptions which are 
the total ordering on the variables and all structures are a priori equally likely and 
have the same probability of occurrence. They proposed a greedy search algorithm, 
called K2 that maximizes the probability of the structure and the database P(Bj, D) by 
finding the parent set of each node that maximizes the function g(Xj, 3t|). 




224 



Fedia Khalfallah and Khaled Mellouli 



Algorithm K2 

For each node i, 1 i £ n, find as follows 
n.f-0 

Fold g(x,, n:,.) ; 

NotDone True 

While NotDone and | j < u do 

Let z be the node in (Predecessors(X;) - jt} that maximizes g(x,, jt. U {z}) 
Pnew ♦- g(X|, Jt, U {z}) 

If Pnew > Fold 
then Fold «- Pnew 

Jt, «- It, U {z} 
else NotDone <- False 

End {While} 

In order to find the parent set of each node, it first assumes that all the nodes have no 
parents. Then for each node x, it selects a node z among the predecessors of x in the 
ordering and adds incrementally that node to the parent set of x which increases the 
probability of the resultant structure by the largest amount. It stops adding parents to 
the node x if the number of parents goes up the limit u or when no additional single 
parent can increase the probability of the structure. 



2.2 B Algorithm [Buntine, 91] 

Like K2, algorithm B is a greedy search heuristic that exploits the same function g(x,, 
ji,) to evaluate the relationship between a node x, and its parent set it,. The main 
difference with K2 is that B does not require a previous ordering on the variables. For 
the pseudo-code of B, we used the same variables and structures as used in K2. 

Let Bj be the network structure at a moment during the execution of the algorithm. 
The objective is to find which link to add in order to increase the quality of the 
structure, without introducing a cycle. Algorithm B calculates a function A which 
represents the difference between structure B^ and structure Bj’ obtained by adding a 
link Vj -* V, to Bj. In other words, function A allows us to compare two relationships 
and to give the most probable one. The relations are the node x, with its parent set n 
; and on the other hand, the node x, with its parent set Ji, enforced by the node Xj (ji, «- 
Jt, U {Xj» ; 

A[i, j] g(x„ Jt, U {x.}) - g(x„ Jt,) ^ 2 ) 

If A[i, j] > 0 then Xj is considered as a parent of x, and it is added to the parent set it,. 
This link added needs to avoid cycles, otherwise, function A is saturated (that means 
Ali.j]- -oo). If an arc v^ -* v^ is added, only values A[k, m], m=l,.,,,n, need to be 
recalculated. In the pseudo-code. A, is the set of indices of the ascendants of a node x, 
and D, is the set of indices of descendants of node x, including i. 
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Algorithm B 

For each i £ {l..n} do Jtj = 0 
For i = 1 to n, j = 1 to n do 
if i j then A[i, j] g(x,, {x,}) - g(x,, 0) 

else A[i, j]'<— -oo {elimination x. —» xj 

Repeat 

Choose i, j that maximize A[i, j] 

If A[i, j] > 0 then 
jr. *- 11 , U {Xj} 

for a £ A|, b G D, do 

A[a, b] < — 00 {elimination of loops} 

for k = 1 to n do 

if A[i, k] > - 00 then A[i, k] g(x„ 3i, U {xj) - g(x,, rt,) 

Until A[i, j] s 0 or A[i, j] = - oo for all i and j 

As K2, in order to find the parent set of each node, it first assumes that all the parent 
sets are empty. Then for each node x, instead of selecting a node z from the 
predecessors, it chooses the node among all the other nodes (z £ V - {x}). It adds 
incrementally to the parent set of x the node which increases the probability of the 
resulting structure and does not produce loops. 



3 New proposed solution 

The starting point of this new algorithm are the two previous algorithms K2 and B 
presented in section 2. It is an algorithm that does not require a total ordering on the 
nodes. Nevertheless, we ask the expert to give his point of view concerning subsets of 
nodes which we call clusters. Therefore, we have to apply algorithm K2 inside the 
clusters to obtain several structures (one structure per cluster). Next step consists in 
applying algorithm B between variables in different clusters detected previously. So 
algorithm B is not applied on all the variables but just on the variables between 
clusters in order to link them. 

In order to improve our algorithm, we use a heuristic based on limiting the number of 
parents that a node can possess when applying algorithm B. We have also opted not 
to eliminate completely the expert in order to remain more efficient. Hence, we ask 
the expert about his point of view on links he is sure of and also about links which he 
is sure they do not exist. The experts affirmations help us to avoid many iterations. So 
in the initialization phase, the parent sets are not empty, but we have an initial 
structure. The next source of information is to ask the expert about an eventual 
ordering of the clusters. It means that when a cluster C, is parent of cluster Cj, then 
we eliminate all links going from cluster Cj to cluster C,. This ordering can be given 
by the expert whenever a link is detected between two clusters. We present 
successively the description of the variables and structures used, the algorithm with 
heuristics and the execution followed by an example. 
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3.1 Data description 

For our algorithm, we consider the same variables and structures used in K2 and B. In 
addition we use the following ones: C : Set of nc clusters (C,,..., CJ such that C. H 
Cj = 0, V i j and Q = V, C,; The cluster number t, 1 s t s nc, and Pred,: 
Predecessors set of in the t'" cluster. 

The algorithm considers the same function g(Xj, as K2 and B for evaluating a 
relationship between a node and its parent set Ji., and the same function A[i, j] to 
compare two relationships and to give the most probable one. 

We use a matrix M of n + 1 rows and n columns for representing all existing 
binary links among n variables. The last row is allowed to compute the number of 
parents (U) that a node x. possesses at a moment. This matrix is initialized to 0 which 
corresponds to an initial structure where no variables are linked yet. We still don’t 
have any information about the existence or the non existence of a link (only for this 
type of relations x. «- x, which are cycles that we can eliminate from the beginning). 
M[i, j] conesponds to the relationship between node x. and node x,. We denote by 
zero (0) if the arc is still not evaluated, by one (1) if the arc is detected and means that 
x. is a parent of x, (ji, itj U {x.}) and by mines one (-1) if the arc is eliminated 
definitely and is excluded from the structure. Excluded arcs enable us to avoid useless 
calculations since we know that these links should not exist. 

Later on, as we go along the procedure of the algorithm, we present the process of 
updating matrix M, Sometimes, some links are added and other times they are 
excluded. So the tests are exclusively done on the arcs which have not been treated. 
The matrix M is the state of the structure at any time during the algorithm process, 
until we have the last iteration and consequently the best structure. 



3.2 Initialization 

In this section, we present the initialization phase (Fig.l.) that consists first on asking 
the expert to form the different clusters. He selects variables that he t hinks have 
relationships between each other. Once these clusters formed, he can give his opinion 
about an ordering of these variables. 

The next step is to ask the expert about the links that he is sure of the existence and 
about those he is sure of the non existence. We have from the start an initial structure 
that allows us to avoid useless computations. The algorithm completes the structure 
by the links that are not known and not evident. For the matrix M, we add some links 
(M[i, j] = 1) and we exclude others (M[i, j] = -1). So before starting the algorithm, 
some parent sets are not empty. 

A third step consists of asking the expert about an eventual ordering between the 
several clusters previously formed. For example, if the expert says that cluster 1 is 
parent of cluster 2, it means that we will eliminate all links going from cluster 2 to 
cluster 1. So we eliminate such links and avoid to calculate them. Of course this 
ordering between clusters is given only when the expert is sure about his statements 
and, when it is not the case, this step is skipped in the initialization phase. But, after 
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applying the algorithm, when the first arc linking any two clusters appears, we ask the 
question again to the expert using an interactive process and apply all what should be 
done in that case. We also ask about the maximum number of parents which a node 
can have during the algorithm process. 




Fig. 1. Initialization phase of K2B 



3.3 Description of the algorithm K2B 

Once the initialization phase achieved, we describe now the algorithm process 
(Fig.2.). The following step consists in applying the algorithm K2 on all the clusters 
(nc times). We obtain several sub-structures that must be linked between each other. 
In the case where we have only one unit (nc = 1) and consequently we have an 
ordering to all the variables, the algorithm ends since we have obtained the desired 
structure and links between all the variables, thanks to K2 algorithm. 

In the opposite case, when there is more than one cluster (nc > 1), we continue the 
algorithm process with the initialization phase of algorithm B. We compute function 
A (previously determined in section 2.2) for all nodes between clusters (example: 
node Xj £ Cj with all nodes x, £ Cj, and so on). Meanwhile, we exclude the links not 
detected in the same cluster and consider only links not treated (M[i, j] = 0). 

The last step consists in selecting nodes x, and Xj which maximize function A. If 
A[i, j] is positive, we must update the structure. Then, we must exclude links which 
can introduce cycles in the structure and recalculate new values of function A, 
considering all changes. This task is repeated until all values of function A are 
saturated or negative. It means that no more updating can improve the structure. Once 
this state is achieved, it means that we have the most probable structure and then we 
can list for each node, its parent set Ji.. For the construction of the network we can 
read the final matrix M. Each time we have the value Mfi, j] = 1, we create an arc 
going from node to node x^. 
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Step 1 : Initialization 
phase 

Step 2 : Apply K2 



Step 3 ■. Apply B 



Step 4 ; Developing 
structure 



Fig. 2 . Algorithm process of K2B 

4 Example 

In this section, we present an example illustrating the execution of oui algorithm. Let 
V be composed of 9 variables {xj,..., x,}, D be the database of cases and M the matrix 
of n+1 rows and n columns. The different evaluations provided by the database are, in 
the following example fictive. The maximal number of parents that a node can 
possess is u = 3. All new modifications are written in bold. 
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4.1 Initialization phase 
■ Initialization of matrix M 

Let M be the matrix previously initialized at 0 (since no arcs have been treated yet) 
except 1-cycles (links of type: x,) which can be excluded from the start (M[i, j]= 

-1 for all i = j). 

* Composing the clusters and ordering the variables 

The expert has proposed three clusters. He was able to give his opinion about the 
ordering of the variables in these three sets. But he does not have any information 
about the relationship of the variables between clusters. For example, he is sure that 
x^ is not a parent of x^, and, Xj and x, can be parents of x,. But he can not say anything 
about Xj and x, or x,. 

Since we have these orderings, we can exclude the following links in the matrix M ; 
M[l, 2] = -1 ; M[l, 3] = -1 ; M[l, 4] = -1 ; M[2, 3] = -1 ; M[2, 4] = -1 ; M[3, 4] = -1. 



* Detection of sure links 

According to the expert, we add to the parent sets (it,, ..., :t,) the nodes that he is 
sure of. If it is not the case, the parent sets stay empty. For this example, the expert is 
sure of links going from node Xj to x^ and from x, to x^. The list of parent sets is as 
follows : 3T, = 0 , Jtj = {x,}, jEj =0, - 0 , jTj =0, = {X;}, It, =0, It, = 0, It, = 0. On 

the another hand, he is sure of the non existence of the link between variable x, to x,. 
The matrix is then updated as follows : M[2, 1]=1 ; M[ 6 , 5]=1 ; M(9, 1]=-1. 

* Ordering on the clusters 

The expert says that cluster C, is parent of cluster C,. This information permits to 
exclude all links going from cluster C, to cluster C 3 . We have no information about 
their relationship with cluster C,. M[5, 8] = -1 ; M[5, 9] = -1 ; M[ 6 , 8] = -1 ; M[ 6 , 9] 
= -1 ; M[7, 8 ] = -1 ; M[7, 9] = -1. 
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4.2 Step2: Apply K2 

For each cluster, we apply algorithm K2 to determine the structures according the 
orderings of the variables given by the expert. The following results are obtained: 
jt, =0, n, ={x,}, n, ={xj, Jt, ={xj,n :5 =0, Jt, ={xj, ji, ={xj, Ji, =0, n, ={xj. 
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Table 2. Matrix after step 2 



4,3 Step 3: Apply B 

The next step consists in computing function A for all nodes between clusters. The 
number of links calculated is 52 and is obtained by this formula : 

2x(2M..,„.ulc.lxlq|) =2x(lqlxlc,l + jqlxlc,! + |c,lxlq|) 

= 2x(4x2 + 4x3 + 2x3) = 52 

If we had to apply algorithm B for all the nodes, we would have been obliged to 
compute 72 links. For this example, we have only 45 computations to do thanks to the 
heuristics applied before. We have to evaluate all links such that M[i, j] = 0 and 
then apply algorithm B. The values of A[i, j] are computed from the database using 
formula (2). The maximum value is selected in each iteration, and gives us the link to 
be added to the structure. Once selected, we must exclude all links able to introduce 
cycles in the structure. Then we must recalculate function A taking into account the 
modifications. 

Iteration 1 : A[6, 4] is the maximum value. It means that x^ is a parent of 
U{xJ). To avoid introducing cycles, we must exclude all the links A[a, b] such that 
a belongs to the set of ascendants of x,; (A^ = {5, 4, 2, 1}) and b belongs to the set of 
descendants of x^, with node x^, (D^ = {6, 7}). These are links A[5,6], A[5,7], A[4,6], 
A[4,7], A[2,6], A[2,7], A[l,6], A[l,7]. The two first links are already excluded since 
the nodes in question belong to the same cluster (C^). After these modifications, we 
recalculate the following links A[6,l], A[6,8], A[6,2], A[6,9], A[6,3], A[6,4]. Matrix 
M is updated by: M[l,6]=-1 ; M[l,7] = -1 ; M[2,6] = -1 ; M[2,7] = -1 ; M[4,6] =-l ; 
M[4,7] =-l ; M[6,4] = 1. 

Iteration 2 : The next maximum value is A[9, 7], It means that x, is parent of x, (rt^ *- 
n, U {x,}). It is the same process as before. Matrix M is updated by these values : 
M[l, 9] = -1 ; M[2, 9] = -1 ; M[4, 9] = -1 ; M[9, 7] = 1. 
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Iteration 3: Optimal value A[7, 3]. It means that Xj is a parent of x,. Matrix M is 
updated by : M[3,7]=-l ; M[3,9]=-l ; M[7, 3]=1. After detecting this link, the expert 
confirmed that cluster Cj is parent of cluster C,. Consequently, we can exclude all 
links going from cluster to cluster C^. Matrix M is updated by : M[l,5]=-1 ; 
M[2,5]=-l ; M[3.5]=-l ; M[3,6]=-l ; M[4, 5] = -1. 



Iteration 4 : Optimal value A[9, 6]. It means that x^ is a parent of x,. Matrix M is 
updated by : M[9, 6] = 1. After this modification, we notice that the number of 
parents that node x, has is j Jt,| = U = 3. It is the maximum value, so we can eliminate 
all links having as child node x,. In matrix M, we just put (-1) instead of all the (0) in 
column 9 (M[9, 2] = -1 ; M[9, 3] = -1 ; M[9, 4] = -1 ; M[9, 5] = -1). 
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Table 3. Matrix after step 3 

If all links not treated have a negative or saturated function A which is the case for 
the example, then the algorithm ends. We say that we have obtained the optimal 
structure and that no other changes can improve the structure. 

4.4 Step 4: Developiog structure 

From the last matrix, we obtain the following final structure: 




Fig. 3. The final structure 



5 Conclusion 



The problem of identification a structure of a Bayesian network for a given problem 
is a difficult and time-consuming task. Especially when it is a complex domain or 
when experts do not exist. We can use databases to overcome the difficulty raised by 
the absence of experts and by complexity of the domains. Several methods are 
developed to find the best structure that represents a problem [Herskovits, 91], 
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[Cooper, 91], [Heckerman, 94], [Bouckaert, 94], But this task is still complex because 
we must determine the more expressive, coherent and probable structure in less time 
as possible. It will be an efficient method for assisting (in some special cases 
replacing) the manual expert method. In this paper, we proposed a method to detect 
the most probable structure. Some heuristics are used to improve the implementation 
of our algorithm. 
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Abstract. We consider, in this paper, the problem of knowledge base 
merging with integrity constraints. We propose a logical characterization 
of those operators and give a representation theorem in terms of pre- 
orders on interpretations. We show the close connection between belief 
revision and merging operators and we show that our proposal extends 
the pure merging case (i.e. without integrity constraints) we study in a 
previous work. Finally we show that Liberatore and Schaerf commutative 
revision operators can be seen as a special case of merging. 



1 Introduction 

An important issue of distributed knowledge systems is to be able to determine 
a global consistent state (knowledge) of the system. Consider, for example, the 
problem of the combination of several expert systems. Suppose that each expert 
system codes the knowledge of an human expert. To build an expert system it 
is reasonable to try to combine all these knowledge bases in a single knowledge 
base that expresses the knowledge of the expert group. This process allows to 
discover new pieces of knowledge distributing among the sources. For example 
if an expert knows that a is true and another knows that a ^ b holds, then 
the “synthesized” knowledge knows that b is true whereas none of the expert 
knows it. This was called implicit knowledge in [8]. However, simply put these 
knowledge bases together is a wrong way since there could be contradictions 
between some experts. 

Some logical characterizations of merging have been proposed [18, 19, 13, 14, 
16,15,11]. In this paper we extend these works by proposing a logical char- 
acterization when the result of the merging has to obey to a set of integrity 
constraints. 

We define two subclasses of merging operators, namely majority merging 
and arbitration operators. The former striving to satisfy a maximum of protag- 
onists, the latter trying to satisfy each protagonist to the best possible degree. 
In other words majority operators try to minimize global dissatisfaction whereas 
arbitration operators try to minimize individual dissatisfaction. 

* The proofs have been omitted for space requirements but can be found in the ex- 
tended version of this work [12]. 
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We also provide a representation theorem a la Katsuno Mendelzon [9] and 
we show the close connections between belief revision and merging operators. 

In section 2 we state some notations. In section 3 we propose a logical def- 
inition of merging operators with integrity constraints, we define majority and 
arbitration operators and give a model-theoretic representation of those oper- 
ators. In section 4 we define two families of merging operators illustrating the 
logical definition. In section 5 we show the connections with other related works, 
first we show the close connection between belief revision and merging opera- 
tors, then we show that this work extends the one of [11]. Finally we show that 
Liberatore and Schaerf commutative revision operators can be seen as a special 
case of merging operators. In section 6 we give some conclusions and discuss 
future work. 



2 Preliminaries 

We consider a propositional language £ over a finite alphabet V of propositional 
atoms. An interpretation is a function from V to {0, 1}. The set of all the in- 
terpretations is denoted W. An interpretation / is a model of a formula if and 
only if it makes it true in the usual classical truth functional way. Let be a 
formula, mod{ip) denotes the set of models of ip, i.e. mod{ip) = {I £W I \= ip}. 

A knowledge base K is a finite set of propositional formulae which can be 
seen as the formula p which is the conjunction of the formulae of K. 

Let Pi,. . . ,Pj^ be n knowledge bases (not necessarily different) . We call 
knowledge set the multi-set consisting of those n knowledge bases: ^ = 
{pi, . . . ,p„}. We note /\<P the conjunction of the knowledge bases of ?£, i.e. 
f\^ = Pi A ■■■ A p^. The union of multi-sets will be noted U. Knowledge bases 
will be denoted by lower case Greek letters and knowledge sets by upper case 
Greek letters. 

Since an inconsistent knowledge base gives no information for the merging 
process, we will suppose in the rest of the paper that the knowledge bases are 
consistent. 

Definition 1. A knowledge set d/ is consistent if and only if f\^ is consistent. 
We will use mod(lF) to denote mod(f\^) and write I \=W for I G mod(lF). 

Definition 2. Letd/ 1,^2 be two knowledge sets, d^i anddl '2 are equivalent, noted 
'Z'l -H- 'Z' 2 , iff there exists a hijection f from <£1 = , . . . , p]f} to 'Z '2 = {Pi, ■ ■ ■ , 

such that \~ f{p) -f4 p. 

A pre-order < over W is a reflexive and transitive relation on W. A pre-order 
is total if VI, J G W / < J or ./ < /. Let < be a pre-order over W, we define < 
as follows: / < J iff I < J and J ^ /, and ~as7~Jiff/<J and J < I. We 
wrote I G min(mod(p) , <) iff / |= and VJ G mod{p) I < J. 

By abuse if (y? is a knowledge base, p will also denote the knowledge set 
= {p}. For a positive integer n we will denote the multi-set {tp, . . . 



n 
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3 Merging with Integrity Constraints 

We first state a logical definition for merging with integrity constraints operators 
(IC merging operators for now on) , that is we give a set of properties an operator 
has to satisfy in order to have a rational behaviour concerning the merging. 

In this work, the result of the merging has to obey a set of integrity constraints 
where each integrity constraint is a formula. We suppose that these constraints 
do not contradict each others and we bring them together in a knowledge base 
jjL. This knowledge base represents the constraints for the result of merging the 
knowledge set !?■, not for the knowledge bases in Thus a knowledge base ip 
in ^ does not obey necessarily to the constraints. We consider that integrity 
constraints has to be true in the knowledge base resulting of merging 'I/, that is 
the result is not only consistent with the constraints (as it is often the case in 
databases), but it has to imply the constraints. 

We will consider operators A mapping a knowledge set ^ and a knowledge 
base /X to a knowledge base (^) that represents the merging of the knowledge 
set ^ according to the integrity constraints ji. 

Definition 3. A is an IC merging operator if and only if it satisfies the follow- 
ing postulates: 

(ICO) A,{rF) h /X 

(ICl) If fi is consistent, then A^ (lZ') is consistent 
(IC2) If f\9 is consistent with /i, then A^(??') = f\I/ /\ p 
(IC3) //'f'l -H- 'f'2 and pi -H- /X2, then A^j(lZ'i) -H- 

(IC4) If ip\- jjb and pJ h ji, then A^ ((^ U <^) A T => A^{ip Li pi) A ip' 1. 
(ICS) A^(<Z^i) A A^{^2) b A^(1^i U ^ 2 ) 

(ICG) If A,u(lZ'i) A Aijfl' 2 ) is consistent, then A^(!l'i U if' 2 ) b A^(lZ'i) A A ^{ 1 / 2 ) 
(IC7) A^,(1^)A/x 2 b A^,^^,(iZ^) 

(ICS) // A^j (iZ') A /i 2 is consistent, then A 

The meaning of the postulates is the following: (ICO) assures that the result of 
the merging satisfies the integrity constraints. (ICl) states that if the integrity 
constraints are consistent, then the result of the merging will be consistent. 
(IC2) states that if possible, the result of the merging is simply the conjunction 
of the knowledge bases with the integrity constraints. (ICS) is the principle of 
irrelevance of syntax, i.e. if two knowledge sets are equivalent and two integrity 
constraints bases are logically equivalent then the knowledge bases result of the 
two merging will be logically equivalent. (IC4) is the fairness postulate, the point 
is that when we merge two knowledge bases, merging operators must not give 
preference to one of them. (ICS) expresses the following idea: if a group 
compromises on a set of alternatives which I belongs to, and another group Ii '2 
compromises on another set of alternatives which contains 7, so I has to be in 
the chosen alternatives if we join the two groups. (ICS) and (IC6) together state 
that if you could find two subgroups which agree on at least one alternative, 
then the result of the global arbitration will be exactly those alternatives the 
two groups agree on. (ICS) and (IC6) have been proposed by Revesz [19] for 
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weighted model-fitting operators. (IC7) and (ICS) are a direct generalization of 
the (R5-R6) postulates for revision. They states that the notion of closeness is 
well-behaved (see [9] for a full justification). 

Now we define two merging operators subclasses, namely majority merging 
operators and arbitration operators. 

A majority merging operator is an IC merging operator that satisfies the 
following majority postulate: 



This postulate expresses the fact that if an opinion has a large audience, it 
will be the opinion of the group. 

An arbitration operator is an IC merging operator that satisfies the following 
postulate: 



This postulate ensures that this is the median possible choices that are pre- 
ferred. 

Now that we have a logical definition of IC merging operators, we will define 
a representation theorem that give a more intuitive way to define IC merging 
operators. More precisely we show that to each IC merging operator corresponds 
a family of pre-orders on possible worlds. Let’s first define the following: 

Definition 4. A syncretic assignment is a function mapping each knowledge set 
tp to a total pre-order <xp over interpretations such that for any knowledge sets 
\P, tpi , ^2 and for any knowledge bases ipi,ip 2 -' 

1. If I \= 9 and J \= then I ~ip J 

2. If I \='P and J ^ 'P, then I <g, J 

3. If Pi = p 2 , then <xpi=<>p 2 

4- '^I |— V’l 3T 1= <fi2 ^ 

5. If I <ipj J and I J, then I <tp^uW 2 J 

6. If I <ipj J and I J, then I <ip^uW 2 J 

A majority syncretic assignment is a syneretie assignment whieh satisfies the 
following: 

7. If I < 1^2 J, then 3n I <ipju>p 2 " J 

A fair syncretic assignment is a syneretie assignment whieh satisfies the follow- 
ing: 



(Maj) 3n A„ (Pi U P 2 ") h A„{p 2 ) 



(Arb) (t’i LI ¥^2) (f*i “'P2) 



IM p.2 
jj,2 'P gi 






^ V^2 ^ ^' 2 ) ^ Ml (a) 
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The following theorem states that these conditions on the assignment corre- 
sponds to the properties of the merging operator: 

Theorem 1. An operator is an IC merging operator (respeetively IC majority 
merging operator or IC arbitration operator) if and only if there exists a syn- 
eretie assignment (respectively majority syncretic assignment or fair syncretic 
assignment) that maps each knowledge set 1/ to a total pre-order <\p such that 



As pointed out by D. Makinson (personal communication), this definition of 
merging operators from such assignments can be compared to the framework of 
Social Choice Theory [2, 10, 17]. The aim of Social Choice Theory is to aggregate 
individual choices into a social choice, i.e. to find, for a given set of agents (cor- 
responding to our knowledge sets) with individual preference relations, a social 
preference relation which reflects the preferences of the set of agents. It turns 
out that the conditions 5 and 6 of the syncretic assignment are known in this 
framework as the Pareto conditions and are widely seen as desirable. This bring 
an additional support to postulates (IC5) and (ICG) that correspond respectively 
to conditions 5 and 6. 

We will show in the next section that the set of postulates (IC0-IC8) is 
consistent by given two families of operators satisfying these postulates. That 
is we do not demand to much to merging operators. On the other hand these 
postulates are sufficiently strong to rule out basic merging methods. For example 
we can define a merging operator a la full meet revision, that is: 



But this operator is not an IC merging operator since it does not satisfy (ICG). 

An other basic merging method generally accepted is the conjunction of the 
knowledge bases if consistent and their disjunction otherwise, the generalization 
of this operator in the presence of integrity constraints is the following, if = 



This operator is not an IC merging operator since it does not satisfy (ICG). 

In [5] Benferhat et al. proposed merging operators in the possibility theory 
framework and gave their syntactic counterpart. Their operators merge two pos- 
sibility distributions in a new one. Therefore the nature of the information they 
merge is very different from knowledge sets. Nevertheless one can identify their 
set of possibility distributions with a knowledge set in a natural way. In this case, 
their operators do not satisfy (IC4) nor (ICG). However with some strong con- 
straints on the possibility distributions their luk operator is a majority merging 
operator. 



mod( Aidin')) = mm(rnod(p), <ip)- 




II' A p, if consistent 
p otherwise 



A^(l^) 



A pit consistent, else 
V A /X if consistent 
p otherwise 
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4 Examples of Operators 

We define in this section two families of operators. The first one, the S family, 
is a family of majority merging operators. The second one, the Gmax family, 
gives arbitration operators. 

We will suppose here that we dispose of a distance between interpretations 
(possible worlds), that is a function d : W x W N such that d{I, J) = d(J, I) 
and d(J, J) = 0 iff I = J. 

From now on we define the distance between an interpretation I and a knowl- 
edge base (y? in the following way: d{I, ip) = minj|^,^ d(7, ./) 

Definition 5. Let L' be a knowledge set and let I he an interpretation we de- 
fine the distance between an interpretation and a knowledge set as: ds(I,L') = 
d{I , ip) . Then we have the following pre-order: I J ijfdsiG'T) < 
ds{J,'T)- And the operator is defined by: mod{A^{T')) = min(mod(/r), <§)■ 

Theorem 2. is an IC majority merging operator. 

Definition 6. Let W be a knowledge set. Suppose = {ipi . . . For each in- 
terpretation I we build the list {d{ . . . d^) of distances between this interpretation 
and the n knowledge bases in F, i.e. dj = d{I,ipj). Let be the list obtained 
from {d{ . . . dlf) by sorting it in descending order. Let <iex be the lexicographical 
order between sequences of integers (of the same length), we define the following 
pre-order: I J iff Lj <iex Tj. And the operator is defined 

by: rnod(A®'““®(?7')) = min(mod(/i), 

Theorem 3. is an IC arbitration operator. 

We now give a “concrete” merging example and illustrate the behaviour of 
the two families of operators defined above it. We will choose as distance for the 
operators the Dalai distance [6]. The Dalai distance between two interpretations 
is the number of propositional letters on which the two interpretations differ. 

Example: At a meeting of a block of fiats co-owners, the chairman pro- 

poses for the coming year the construction of a swimming-pool, a tennis-court 
and a private-car-park. But if two of these three items are build, the rent will 
increase significantly. We will denote by S', T, P respectively the construction of 
the swimming-pool, the tennis-court and the private-car-park. We will denote I 
the rent increase. 

The chairman outlines that build two items or more will have an important 
impact on the rent: p = {{S A T) V (S A P) V (T A P)) I 

There is four co-owners F = {p>iUip 2 Up>^Up>^\ . Two of the co-owners want to 
build the three items and don’t care about the rent increase: <^i = <y32 = SATAP. 
The third one thinks that build any item will cause at some time an increase of 
the rent and want to pay the lowest rent so he is opposed to any construction: 
iPs = -iS A -iT A -iP A -i7. The last one thinks that the fiat really needs a tennis- 
court and a private-car-park but don’t want a high rent increase : ip^^ = TAPANI. 
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The propositional letters S, T, P, I will be considered in that order for the 
valuations: 

modifi) = W \ {( 0 , 1 , 1 , 0 ), ( 1 , 0 , 1 , 0 ), ( 1 , 1 , 0 , 0 ), ( 1 , 1 , 1 , 0 )} 
mod{ip^) = {( 1 , 1 , 1 , 1 ), ( 1 , 1 , 1 , 0 )} modtyip^) = 1 ( 1 , 1 , 1 , 1 ), ( 1 , 1 , 1 , 0 )} 

modiips) = 1 ( 0 , 0 , 0 , 0 )} modlyifi) = {( 1 , 1 , 1 , 0 ), ( 0 , 1 , 1 , 0 )} 

We sum up the calculations in table 1. The lines shadowed correspond to the 
interpretations rejected by the integrity constraints. Thus the result has to be 
found among the interpretations that are not shadowed. 



Table 1. Distances 





^1 


Vi 


Vs 


V4: 
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(0,0, 0,0) 
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(0,0, 0,1) 


3 
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(0,1, 0,0) 


2 


2 


1 


1 


6 


(2,2,1,1) 


(0,1, 0,1) 


2 
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1 


2 
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(1,0, 0,1) 
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1(1,0, 1,0) 
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(2,1,1,1) 1 


(1,0, 1,1) 
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1(1, 1,0,0) 
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(1,1, 0,1) 
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1 


3 


2 


7 


(3,2,1,1) 


1(1, 1,1,0) 


0 


0 


3 


0 


3 


(3,0,0,0) 1 



(1,1, 1,1) 0 0 4 1 5 (4,1,0,0) 



With /\pMax jjierging criterion mod(A^'^“®(if')) = {(0, 0, 1, 0), (0, 1, 0, 0)}, 
so the decisions that best fit the group and that are allowed by the integrity 
constraints are to build either the tennis-court or the private-car-park, without 
increase the rent. Whereas if one takes the decision according to the majority 
wishes then with the operator we have mod{A^{^)) = {(1, 1, 1, 1)}, and the 
decision that satisfies the majority of the group is to build the three items and 
to increase the rent. 

This majority “vote” seems to be more “democratic” that the other method. For 
example in this case it works only if accepts to conform to the majority wishes 
that is strongly opposed to its own. But could decide to quit the co-owners 
committee, and the works will perhaps not carry on because of a lack of money. 
So if a decision, like in this example or like in a peace agreement or in a price 
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agreement in a competitive market, requires the approval of all the members an 
arbitration method like seems more adequate. 

5 Connections with Related Works 

5.1 Belief Revision 

We show in this section that merging operators are related to AGM belief revision 
operators [1, 7, 9], the first result is easy to prove: 

Theorem 4. If A is an IC merging operator, then the operator o, defined as 
ipo n = is an AGM revision operator. 

Conversely, we can wonder if we can build a merging operator from a given 
revision operator. We propose the following definition of a merging operator from 
a given revision operator o: 

Definition 7. 

— Consider the faithful assignment ^ corresponding to the revision operator o. 

— Define f^{I) = n where n is the level where the interpretation I appears in 
the <tp pre-order. More formally n is the length of the longest chain of strict 
inequalities Iq <tp . . . <tp In with Iq \= (p and In = I. 

— Define f^(I) from the f^.(I) with some given merging method (for example 

f^(I) = if the chosen method is the S method). 

— Define I J iff m) < f^{ J). 

— Finally mod{A°^(fF)) = min (mo(i(/i), <xp). 

The question now is to find the properties of the operator defined if we choose 
a particular merging method, for example if we choose the E method (we get 
similar results with a method a la A‘^'““®), that is /^(/) = 
get the following results: 

Theorem 5. If a merging operator A° is defined from a revision operator o 
and from the E merging method according to definition 7, then the operator A° 
satisfies (IC0-IC3), (IC5-IC8) and (Maj). 

Definition 8. We define f°{gl) by putting f°{p') = mmi^^{f°{I)) 

Theorem 6. If a merging operator A° is defined from a revision operator o 
using E merging method according to definition 7, then the operator A° is an 
IC majority merging operator if and only if the faithful assignment satisfies the 
following “symmetry ’’property: f°{p') = f°>iip). 

We say that a revision operator o is defined from a distance d if 



^ i.e. an assignment mapping each knowledge base to a pre-order satisfying conditions 
1-3 of the syncretic assignment but with knowledge bases instead of knowledge sets 
(cf[9]). 
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— d is a distance, that is d is a function d : W x W N that satisfies d(/, J) = 
d( J, I) and d(7, J) = 0 iff / = J. 

— Let <^be a knowledge base and I be an interpretation: d(I, if) = d(I, J) 

— I <^J iff d(J, ip) < d{J, f) 

— mod(ipo jj.) = min(mod(/i), <ip) 

We can show that the only revision operators satisfying the symmetry prop- 
erty are those defined from a distance. So as a corollary we have the following: 

Theorem 7. A merging operator A defined from a revision operator o and the 
E merging method is an IC merging operator if and only if o is defined from a 
distance. 

5.2 Pure Merging 

A logical characterization of merging operators in the case where there is no 
integrity constraints was proposed in [11]. We will call this case the pure merging 
case. 

Definition 9. Let A be an operator mapping a knowledge set L' to a knowledge 
base A(lZ'). A is a pure merging operator if and only if it satisfies the following 
postulates: 

(Al) A(^) is consistent 

(A2) If^ is consistent, then A(!?') = f\^ 

(A3) 7/7^1 •H- then A(lZ'i) -H- A(fp 2 ) 

(A4) If ip/\ifl is not consistent, then A(:^U i(f)V- ip 
(A5) A(iZ^i) A A(iZ^2) ^ A(i7'i U IT's) 

(A6) If A(lZ'i) A A(f!> 2 ) is consistent, then A{t!/^ U L' 2 ) A A{fi/ 2 ) 

A pure merging operator is a pure majority operator if it satisfies (M7): 
(M7) yip 3n AfiFU ip^) h p> 

A pure merging operator is a pure arbitration operator if it satisfies (A7); 
(A7) yif! 3ip ipj V- ip^n A{i^ \A ip^) = A{ipl U ip) 

First it is easy to see that the postulates obtain from (ICi) ones when p = 

T are nearly the same that those given in [11]. The main differences is that 
postulate (IC4) is stronger than (A4) and that postulate (Maj) is stronger than 
(M7). Notice also that postulate (Arb) is not expressible when p = T . So there is 
no direct relationship between arbitration in the sense of [11] and IC arbitration. 
Notice that (A7) expresses only a kind of non-majority rule and thus is not a 
direct characterization of arbitration, whereas (Arb) defines in a more positive 
manner the arbitration behaviour. 

Theorem 8. If A is an IC merging operator, then Ay is a pure merging op- 
erator (i.e. it satisfies (A1-A6)). Furthermore if A is an IC majority merging 
operator, then Ay is a pure majority merging operator. 
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5.3 Liberatore and Schaerf Commutative Revision 

This section addresses the links between merging operators and those defined by 
Liberatore and Schaerf. The postulates given by Liberatore and Schaerf [13, 14] 
for commutative revision are the following: 

(LSI) (p O II llO(p 
(LS2) ip A n implies ipo n 

(LS3) If ip A fi is satisfiable then ipo n implies ip A fi 
(LS4) (po H is unsatisfiable iff both ip and p are unsatisfiable 
(LS5) If Pi -f4 p2 and Pi -H- p2 then pi o pi p2<> P2 

{ pop or 

pod or 

{pop)V {po 6 ) 

(LS7) pop implies p\/ p 

(LS8) If p is satisfiable then p A {p:)0 p) is also satisfiable 

This definition of commutative revision operators is very close to the one of 
belief revision operators. But it suffers two drawbacks from a merging point of 
view. First it allows to merge only two knowledge bases. And it forces the result 
to be in the disjunction of the two given knowledge bases. We argue in [11, 12] 
that it has not to be always the case. 

Definition 10. //A is an IC merging operator we define a commutative revision 
operator o^ by po^ p = A^\/i^{p U p). We will say that is the commutative 
revision operator associated with A. 

Theorem 9. If A is an IC merging operator, then the operator associated 
with it satisfies (LS1-LS5),(LS7) and (LS8). 

By definitions operators are commutative, but the following property 
shows that they can be consider as “double revision operators” (we recall that 
po p = Aij{p) is an AGM revision operator). 

Theorem 10. If A is an IC merging operator then it satisfies 

/^) '7 A^((^) 

In order to obtain systematically a commutative revision operator from an 
IC merging operator using definition 10, IC merging operators need to satisfy 
an additional property: 



f if I- -'d 

A^{pV 6») = ■( A^{6) if A^ve ^ (1) 

[ A^{p) V A^{6) otherwise 



Theorem 11. If A is an IC merging operator, then the operator o defined as 
po p = A^\/i^{p U p) satisfies (LS1-LS8) if and only if A satisfies property (1). 
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Remark 1. Property (1) implies = A^(A^(A)) 

This remark shows that property (1) is quite a topological one since = 

1^0 A = {AoLp)oA. That is to say that the result of the revision oi^phy A depends 
only of the models of that are the closest to A. Revisions defined from a distance 
satisfy this property. 

A serious drawback of commutative revision definition is that it does not 
allow to merge more than two knowledge bases since it is not associative (see 
[13, 14]), but the idea that the result of the merging has to implies the disjunction 
of the knowledge bases can be very useful in a lot of applications. IC merging 
operators allow to generalize Liberatore and Schaerf operators to n knowledge 
bases, by defining the merging of a knowledge set U . . . U as: 

A,^jV...vv„(<Pi U . . . U 

The logical properties of these operators are worth more study. 

6 Conclusion 

In this paper we have presented a logical framework for knowledge base merging 
in the presence of integrity constraints when there is no preference over the 
knowledge bases. We stated a set of properties an IC merging operator should 
satisfy in order to have a rational behaviour. This set of properties can then be 
used to classify particular merging methods. 

We made a distinction between arbitration and majority operators, arbi- 
tration operators striving to minimise individual dissatisfaction and majority 
operators trying to minimise global dissatisfaction. An open question is to know 
if arbitration and majority merging are two distinct merging subclasses or if 
it is possible for a merging operator to be both an arbitration and a majority 
merging operator. 

We provide a model-theoretic characterisation for IC merging operators. This 
characterisation is much more natural than the one in [11], due to the presence 
of integrity constraints. 

Actually, in a committee, all the protagonists do not have the same weight 
on the final decision, so one generally needs to weight each knowledge base to 
reflect this. The idea behind weights is that the higher weight a knowledge base 
has, the more important it is. If the knowledge bases reflect the view of several 
people, weights could represent, for example, the cardinality of each group. We 
want to characterize logically the use of this weights. Majority operators are 
close to this idea of weighted operators since they allow to take cardinalities into 
account. But a more subtle treatment of weights in merging is still to do. 

An on going work is the study of merging operators that adopt a coherence 
approach to theory merging. These operators are based on an union of all the 
knowledge bases and on the selection of some maximal subsets due to a given 
order (not necessarily the inclusion), see e.g. [3,4]. An important drawback of 
coherence merging operators is that the source of each knowledge is lost in the 
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merging process. So the problem is to take into account the source of each piece 
of information in order to allow subtler behaviours for merging operators, for 
example define majority or arbitration operators. 
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Abstract. Introducing the notion of Boolean- valued Sugeno integral 
and applying it to a particular Boolean algebra defined over the set of 
special binary matrices, and defining a mapping which takes these ma- 
trices into real numbers from the unit interval, we can prove that the 
classical integral of a function taking a finite probability space into the 
unit interval can be defined by the value which the mapping in ques- 
tion ascribes to the corresponding value of the Boolean-valued Sugeno 
integral. 



1 Introduction 

There were the problems involved by practical applicability of algorithms for de- 
cision making under uncertainty in various knowledge-based and expert systems 
which have motivated the search for alternative mathematical models and tools 
for uncertainty quantification and processing, perhaps less powerful than those 
offered by the classical probability theory and mathematical statistics, but more 
fitted and adopted to the poor level of knowledge concerning the investigated 
system and its environment. Possibilistic measures serve as good examples of 
such alternative approaches. 

On the other side, however, when developing an alternative mathematical 
model for uncertainty processing and quantification, it seems to be quite reason- 
able and legitimate to try to introduce analogies of the notions and tools which 
proved themselves to be useful in probability theory supposing, of course, that 
these notions can be defined in a modified way coping with the framework of the 
alternative model under investigation. In particular, we shall attempt, below, to 
embed an appropriate analogy of the notion of expected value, an elementary 
one in probability theory, into the framework of possibility theory. To achieve 
this goal, we shall use the notion of Sugeno integral, introduced and developed 
in the theory of fuzzy sets as a numerical characteristic or degree of fuzziness 
“contained” in a particular fuzzy set. 

2 Real-Valued Possibilistic Measures and Sugeno 
Integrals 

Definition 2.1. Let 12 be a nonempty set, let V{ fl) denote the set of all subsets of 
fi also denoted by 2^. A mapping p : V{ fl) (0, 1) ascribing to each subset A 
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of 12 a real number ii{A) from the unit interval {0, 1), also denoted by /, is called 
a possibilistic measure on Q, if p.{0) = 1, /i(0) = 0, and p,{AUB) = p,{A) V p.{B) 
for all A, B C O. Here 0 denotes the empty subset of 12 and V (A, resp.) denotes 
the standard operation of supremum (infimum, resp.) in the unit interval I of 
real numbers. Possibilistic measure on 12 is called distributive, if p.{A) = 
^ueA holds for each ^ A C fi, where denotes the standard 

operation of supremum over subsets of 12 (and denotes the corresponding 

dual operation of infimum) and {w} denotes the singleton generated by a; € 12. 
The mapping : 12 7 defined by T^ixito) = p,{{u;}) is called possibilistic 

distribution induced by p. Dually, given a mapping tt : O ^ I, the mapping 
p„ : V(fl) -A I defined by p^iA) = for all 0 A C 22, /x(0) = 0, is 

called the possibilistic measure induced by tt supposing that 
this case obviously p{{tu}) = 7r(o;). □ 

(i) Every possibilistic measure defined on a finite space 12 is evidently dis- 
tributive, but this is not the case when 12 is infinite. Or, put p{A) = 0 for each 
finite subset A of 12 including the empty one, and put p{A) = 1 for each infinite 
Ac O. 

(ii) Every possibilistic measure is monotonous with respect to set-theoretic 
inclusion, in symbols, p(A) < p{B) holds for each A C B C fi. Or, A C B 
implies that B = Au {B — A), so that p{B) = p{A) V p{B — ^) > p{A). As 
Ac B C A and Ac B C B hold for all A, B C O, we obtain the inequality 
p{A C B) < p{A) A p{B) which cannot be, in general, replaced by the equality. 
Or, take p : V{fi) -A I such that p{%) = 0 and p{A) = 1 for all 0 A C 12 
(this possibilistic measure is induced by the constant possibilistic distribution 
7r(o;) = 1 for each a; € 12). Eor % ^ A, B C fl such that A n 5 = 0 we obtain 
that p{A n 5) = 0 < 1 = p{A) A p{B). 

Definition 2.2. Let 12 be a nonempty set, let phe a possibilistic measure on 12, 
let / be a mapping taking 12 into I. Then the Sugeno integral over fl of the 
function / with respect to the possibilistic measure p is defined by 

^ / d/i = j [a A p{{u; e 22 : /(w) > a})] . (1) 

□ 

The symbol 22 under the integration symbol f will be omitted, if no misun- 
derstanding menaces. 

Theorem 2.1. Let p he a possibilistic measure on a nonempty space 22, let 
f,g. 22 7 be functions, let (/ ® g) (to) denote f{to) ® g(to) for all a; € 22 

and for ® := • (standard product), V, A and ©, where a; © y = (a; -I- y) A 1 for 
all X, y G I, hence, © is a meta-symbol, not a particular binary operation. Let 
f < g mean that f{to) < g{to) holds for each a; € 22. Then 
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(ii) ^(/ • S') d/t < ^(/ A 5f) d/t < f d/t j ^ S' ^ 

< / d/t j 9 d/i j = ^(/ V fif) d/t < y (/ © fif) d/t. □ 

Proof. A matter of technical routine. □ 

Theorem 2.2. Let /t be a distributive possibilistic measure on a nonempty space 
f2, let / : f2 7 be a function. Then 

^/d/t = V^gf2[/MA/r({w})]. (2) 

□ 



Proof. Take x £ I such that x < § f dfi =df \/aei^^ ^ ^ ^ 

holds. Then there exists a £ I such that a A £ fi : f{to) > ct}) > x holds, 
consequently, ii{{lo £ fi : f{to) > ct}) > x and ct > a; hold as well. Hence, for 
each y < X there exists tOy £ fi such that f{tOy) > a and y,{{tUy}) > y hold, as 
/t is supposed to be distributive. It follows that f{tOy) A y,{{tUy}) > a Ay = y, 
as a > X > y. As this inequality holds for each y £ I, y < x, we obtain that 
A /r({o;})] > x holds. Consequently, 

\/uen [/(^) ^ >\J^x£l:x<j) /d/r| = j) f d/i. (3) 

To arrive at a contradiction, suppose that A /r({o;})] > § f dfi 

holds. Then there exists tuo £ fi such that 

/(wo) A /r({wo}) > j) fdiJ = V„G/[“ ^ ^ ^ • /(^) ^ “})] W 

holds. Setting a = f{too) we obtain that y.{{to £ fi : f{to) > a}) > y,{{tuo}) 
holds. Consequently, 

^ /d/r = [a A /r({w G C : /(w) > a})] > (5) 

^ Va=/©o) ^ ^ ^ = 

= fM A ii{{uj £ Q : f{uj) > f{uJo)}) > f{uJo) A /r({wo}), 

but this result contradicts the inequality (4). So, the equality in (3) holds and 
the assertion is proved. □ 

3 Boolean Algebras — Preliminaries and Some Cases 
of Particular Interest 



There are many equivalent definitions of the notion of Boolean algebra outgoing 
from various lists of primitive notions and various sets of axioms. For the sake 
of definiteness let us introduce the definition from [4] . 
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Definition 3.1. Boolean algebra B is a quadruple {B, V, A, -■), where 

(i) B is a nonempty set called the support (set) of B, 

(ii) V (a, resp.) is a total binary operation called supremum (inRmum, 
resp.) which takes B x B into B, and 

(ill) -I is a total unary operation called complement, which takes B into 

B, 

such that the following axioms hold true for all x, y, z £ B: 

(Al) xV y = y V X, x Ay = y A x, 

(A2) X V (y V z) = (x V y) V z, x A (y A z) = (x Ay) A z, 

(A3) X V (y A z) = (x V y) A (x V z), x A (y V z) = (x Ay) V (x A z), 

(A4) X A {xV y) = X, x\/ (x Ay) = x, 

(A5) (x A (-<x)) y y = y, (xV (^x)) Ay = y. □ 

It follows easily from axioms (A1)-(A5) that setting Ob = x A (^x) and 
IjS = xV i^x), for an a; e B, both Og (the zero element of the Boolean algebra 
B) and Ig (the unit element of B) are defined uniquely no matter which x £ B 
may be taken. In what follow, we shall always suppose that cardB > 2 holds. 
The binary relation <b on B defined by x <b y i& x Ay = x (or, what turns 
to be the same, IQ x\/ y = y), x, y £ B, can be easily proved to possess all the 
properties of partial ordering on B such that V (A, resp.) is just the supremum 
(infimum, resp.) operation generated by in B. The indices B in symbols like 
Ofi, Ifi, etc. will be omitted, if no misunderstanding menaces, x -<b y means 
that X y and x ^ y holds simulatensously. 

Remark. We take into consideration the menacing ambiguity resulting from the 
fact that we use for supremum and infimum operations over Boolean algebras 
the same symbols as for the standard supremum and infimum operations on the 
unit interval of real numbers. Let us hope that this danger will be avoided or at 
least minimized due to the fact that, since now on till the end of this paper, the 
above introduced standard operations on (0, 1) will not be used. 

Let us denote by = {1,2,...} the set of all positive integers. Given 
n e A/’+, let [n] denote the subset {l,2,...,nj C A/’+ and let = {0,1}" 
denote the space of all binary n-tuples. For each n £ A/’+ we define three Boolean 
algebras B(^ and as follows. 

(i) = {V{[n]),yj, n, [n] — •), hence, is the Boolean algebra of all subsets 
of the set [n] of positive integers with respect to the usual set-theoretic operations 
of union, intersection and complement. 

(ii) B\ = (B„,Vi,Ai,l" - •), where {xi,...,Xn) Vi (yi,...,y„) =df {xi V 
yi,X2Vy2,- ■ .,XnVyn) and {xi,.. .,Xn)Ai {yi, ...,yn) =df {x\Ayi,X2Ay2, . . . ,a;„ 
Ay„) for each x = {xi,...,Xn), y = {yi,...,y„) £ {0,1}", So, the Tth 
component of a; Vi y is 1 iff at least one of the Tth components of x and 
y is 1, and the Tth component of a; A y is 1 iff both the Tth components 
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of X and y are 1. Finally, 1" — a; = {1 — ai, 1 — ^ 2 , . . . , 1 — a;„), so that 
1" = (1,1,..., 1) is the unit element and 0" = (0,0,..., 0) is the zero ele- 
ment of Evidently, both and B\ are isomorphic, their isomorphism being 
defined by the mapping y; : V{A) {0, 1}" ascribing to each A C [n] the n-tuple 
■ ■ -,x{^)n) such that = 1, if i G A, = 0 otherwise. 

Using this mapping, the operations Vi, Ai and 1" — • can be defined by 

xXiV = x{x~^{x)VJx~^{y)) , (6) 

xXiy = x {x~^ix)f^x~^iy)) , 

1” = x{[n] -X~\x)) , 

for every x, y £ {0, 1}". 

(iii) Given n G and x = {xi,. . . ,x„) G {0, 1}", set 

^(tc) = (1-2-”)-'^” ^2-L (7) 

Evidently, ip{x) £ I = (0,1) for all x £ {0,1}", <^(0") = 0 and <^(1") = 1. 

Denoting by = {ct G / : ct = (f{x) for some x £ {0, 1}"} the set of possible 

values taken by the mapping (p, we can easily see that (p defines a 1 — 1 mapping 
between {0, 1}" and D„. So, given a, (3 £ D„, we may define operations V 2 , A 2 
and 1 0 • in this way: 



a\/2(3 = (p{(p~Ha)\/iip~^(3)) , (8) 

a A 2 13 = (p [ip~^ (a) A 2 <p~^ (/!)) , 
lQa = p>{y^ - ip~^{a)) , 

as <^“^(1) = 1" obviously holds. Evidently, the operations Vi, Ai, V 2 , A 2 , 1" — 
•, 1 0 •, as well as the mappings y; and (p depend also on the parameter n, but in 
order to simplify our notations we shall not introduce this dependence explicitly 
unless a misunderstanding menaces. The quadruple B“^ = (D„,V 2 , A 2 , 1 0 •) is 
obviously a Boolean algebra isomorphic to B^ and B^. 

Two properties of B^ are perhaps worth being mentioned explicitly. 

(a) Eor each n £ , 1 0 a = 1 — a for all n G and all a £ T>„. 

(b) Eor each e > 0 there exists n £ such that \p>{x) — h{x)\ < e holds 

uniformly for all x £ {0,1}", here h{x) = number in 

I defined when taking x as its binary decomposition and setting Xi = 0 for 
all i > n. The only we need is to take n such that 2“" < e holds, so that 
n > lg 2 (l/e) will do. The reason for which we shall use p>{x) instead of h{x) 
is to ensure that 1 G D„, i.e., that :^(1") = 1. On the other side, when using 
infinite binary sequences from {0, 1}°° and setting p>{x) = Xi 2“* for x = 

{xi,X 2 ,---) £ {0,1}°° as usual, we would arrive at serious obstacles in what 
follows, caused by the ambiguity of infinite binary decompositions of some real 
numbers from I ((1, 0, 0, . . . , ) -H- (0, 1, 1, . . .)). 
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Let ^ : V{[n\) L>„ be the 1 — 1 mapping defined by the composition of the 

mappings x and {p, namely 

4,{A) = p,{x{A)) = (1 - 2-”)-i x(A)i 2-* (9) 

for each A C [n] = {l,2,...,n} C ■ Let us define a partial operation of 
addition in L>„ in this way: Let ai,a 2 , ■ ■ ■ , am be a sequence of numbers from 
Dn such that the subsets C [n] are mutually disjoint (it follows im- 
mediately, that if all a^’s are nonzero, m < n must hold). Then =df 

V™*-!*^* (— '^1 ^2 0:2 V2 0:3 V2 • • • V2 an)- If the subsets tp~^{ai) are not mu- 
tually disjoint, a, is not defined. It follows immediately that, if defined, 
= ai -1-0:2 + • • ■ + Om, where -I- denotes the standard addition 

in I. 

A Boolean algebra B = (B,Vb,Ab,-'b) is called complete, if for all 0 7 ^ C 
B the supremum xeD^ (abbreviated to \/^ D) and the infimum xeD^ 
(abbreviated to f\j^D) are defined. E.g., all the Boolean algebras 
introduced above are complete. 

4 Boolean-Valued Sugeno Integrals 

Definition 4-1- Let 12 be a nonempty set, let B = (B,V,A,-i) be a complete 
Boolean algebra. B-valued possibilistic measure on D is a mapping p : V(D) —>■ B 
such that p(0) = Og, p(D) = Ig, and p(EuF) = p{E) V p{E) for all E, E C D. 
The possibilistic measure p is called distributive, if p{A) = for 

Ac D. “ □ 

The identity mapping i : V{0) V{0), i.e., i{E) = E for each E C O, 

is also a (trivial) Boolean-valued possibilistic measure taking its values in the 
Boolean algebra (P(f2), U, fl, f2 — •) of all subsets of Q. 

Definition 4-3. Let D and B be as in Definition 4.1, let p : V{D) B he a B- 
valued possibilistic measure on 12, let /: 12 B be a total mapping (function) 
ascribing a value f{tu) £ B to each a; € 12. Then the B-valued Sugeno integral 
{B-S. integral, abbreviately) of the function / over the space 12 and with respect 
to the B-valued possibilistic measure p is defined by 

^ / d/i = [a A p{{u; G 12 : /(w) a})] . (10) 

□ 

As in the case of real-valued possibilistic measures, if 12 is finite, every B- 
valued possibilistic measure on 12 is distributive by definition, but if 12 is infinite, 
this need not be the case in general. Or, take p{A) = Ig for all infinite subsets 
of 12 of p{A) = Ijs for all infinite subsets of 12 and p{A) = Og for all finite 
A C 12 including the empty set 0. Then Ig = p{A) 7 I \J ^^^p{{tu}) = Og for 
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each infinite A C (the supposed completeness of the Boolean algebra 8 assures 
that the supremum exists, i.e., is defined, but not that it equals 

to 

Theorem 4-1. Let f2 be a nonempty set, let B = (B, Vg, Ag, -ig) be a complete 
Boolean algebra, let /t : V{0) B he a distributive B-valued possibilistic 
measure on 17, let / : f2 B be a mapping (function). Then the relation 

= ( 11 ) 

holds. The index B is purposedly locally explicited here in order to pick out the 
differences between Theorems 2.2 and 4.2. □ 

Proof. The proof copies that of Theorem 2.2 for the real- valued case above, but 
now with a careful and detailed verification, whether all the steps will pass also 
for supremum and infimum operations in Boolean algebras. Having performed 
such a verification we can see that the proof of Theorem 2.2 passes well just with 
operations and relations over unit interval of reals replaced by those defined in 
the Boolean algebra B. □ 

In what follows, we shall consider, in more detail, B°xn"Valued possibilistic 
measures and Sugeno integrals, where is the Boolean algebra of all subsets 
of the Cartesian product [n] x [n] with respect to the usual set-theoretic opera- 
tions. We shall describe the subsets of [n] x [n], i.e., the subsets of {(f,j) : i,j € 

, 1 < i, j < n}, as binary matrices, in order to facilitate our definitions of 
particular classes of such matrices. So, B^xn ~ (Bnxn, V*j A*, B„xn — •), where 

(i) Bnxn = ^ {0, 1}}, 

(ii) V* — {xij y 

(iii) 

(iv) Bnxn ~ {Xij)i^j^i = (1 ~ 

here Xij V yij (xij Ayij, resp.) is the standard supremum (infimum) in {0, 1}, so 
that Xij V yij = 1 iff Xij or yij = 1, and Xij A yij = 1 iff Xij = yij = 1. 

A matrix X € Bnxn is called horizontal, if each of its rows is either 0" (= 
(0, 0, . . . , 0) e B„ = {0, 1}") or 1" (= (1,1,..., 1)) • X is called vertical, if each 
of its columns is either 0" or 1". So, denoting by B„xn,h (Bnxn.v, resp.) the set 
of all horizontal (vertical, resp.) matrices from Bnxn, we obtain that 

(a;y)”j^i G Bnxn,h iff, for all l<i, j <n, Xij = xn, (12) 

{Xij)ij^i G Bnxn,v iff, for all 1 < f, j <n, Xij = xij. 

Obviously, the zero matrix 0"^" G Bnxn- containing only occurrences of 0, and 
the unit matrix g Bnxn, containing only occurrences of 1, are simulta- 

neously horizontal and vertical, and there are the only two matrices possesing 
this property, so that Bnxn,h H Bnxn,v = Every horizontal ma- 
trix X = (xij)fj^i is uniquely defined by the vector {xn,X 2 i, ■ ■ ■ ,Xni) G Bn, 
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denoted by r(X), every vertical matrix X = is uniquely defined 

by the vector {xu,xi 2 , ■ ■ ■ ,xin) G Bn, denoted by c(X), both the mappings 
r ■■ Bnxn,h -H- Bn and c : Bnxn,v -H- Bn being 1 - 1. A matrix X e Bnxn,h is 
denoted by X(n,h,a), a G Dn C I, iff 

1 V ^ ^ 

<^(r(X)) =<y9({a;ii,a;2i,...,a;„i)) = (1-2“”)“^ V xa2~^ = a, (13) 

^-^1 = 1 

a matrix X € Bnxn,v is denoted by X(n,v,a), a G Dn C I, iff 

V’iciX)) = ip{{xu,Xl 2 , ■ ■ ■,Xln)) = (1 - 2 “”)“^ ^ 

As can be easily seen, for each a G Dn both the matrices X{n,h,a) and 
X(n,v,a) are defined uniquely. 

A matrix X € Bnxn is called product, if there are Y € Bnxn,h and Z € 
Bnxn,v such that X = Y A* Z. The subset of all product matrices in Bnxn 
is denoted by Bnxn,p- The inclusions Bnxn,h ^ Bnxn,p and Bnxn,v ^ Bnxn,p 
are obvious, as X A* 1"^" = X for each X g Bnxn,h and 1”><” a* X = X 
for each X g Bnxn,v, on the other side, Bnxn,p Bnxn (i-e., not every n x n 
binary matrix is product), at least if n > 2 holds. Or, given Y g Bnxn,h and 
Z g Bnxn,v, its product y A* Z g Bnxn,p is uniquely determined, and each such 
Y {Z, resp.) is uniquely determined by r{X) g {0, 1}" (by c{Z) g {0, 1}", resp.). 
Hence, the inequality 

card(H|jxn,p) ^ oard(Huxn,/i) * oard(Huxn,^;) — (tb) 

= (card({0,l}”))" = (2”)2 = 2^” 

holds, but 

card(H„xn) = card(T’([n] x [n])) = 2i”"i. (16) 

Let us define the mapping : V{[n] x [n]) 7 = (0, 1), which ascribes to 

each set X of pairs of positive integers not greater than n (identified with the 
matrix {xij)fj^^ g Bnxn) the real number <^i^i(X) as follows 

^(2) = (1 - 2-”)-2 a;y2-(*+t). (17) 

t — 1 J — 1 

Theorem 4-2. 

(i) Let X = {xij)f j^^, Y = {yij)fj^i be disjoint in the sense that the inequality 
Xij + Uij < 1 holds for each 1 <i,j < n. Then ip(^HX V* Y) = ^(2)(X) + 

(ii) For all a g T>„ C I, ip^‘^\X{n,h,a)) = ip^“^\X {n,v, a)) = a, in particular, 
<^(2)(0«X«) 

= 0, <^(2)(1"X«) = 1. 

(iii) (p^^'>{X{n,h,a) A* X{n,v,P)) = a(3 (the standard product in I) for all 

a, /3 g Dn- In other terms, A* Y) = ip{r{X)) ■ (p(c(Y)) for all 

X g Bnxn,h, G Bnxn,v- 




Boolean-Like Interpretation of Sugeno Integral 253 



Proof, (i) Denote, for each X = by \{X) the corresponding subset 

of [n] X [n], hence, X(X) = {{i,j) : 1 < *, J < ti, Xij = 1}. If Xij + yij < 1 
holds for each 1 < *, j < n, then the sets \{X) and \{Y) are disjoint, moreover, 
\{X V* Y) = \{X) U A(y) holds in general. So, denoting by Kn the value 
(1 — 2“")“^, we obtain that 



Y)=K^Y.r V y,,-) = 



,(2) 



(18) 



{i,j)e\(xuY) 






(*.i)GA(X)UA(V) 



2-(*+i) = 






E 7i » ^ ^ /• I "A \ ^ \ ^ ^ 



-(*+i) = 



and (i) is proved. 

(ii) Let a € D„, let X{n,h,a) = (a;y)"j^i. Then 

^(2)(x(n,h,a)) 2-(*+^) = (19) 

= (1 - 2-”)-i [(1 - 2-”)-i a;,,- 2-^] 2-L 

As X(n,h,a) is horizontal, (1 — 2“")“^ a;^ 2“-^ = 1, if a;,! (= Xi^ = 

••• = Xi ) = 1, this sum being zero, if a;,! = 0. Hence, {X{n,h,a)) = 
(l-2-")-iE”.i^a 2- ■*, but this expression equals to a just according to the 
definition of X{n,h,a). For X{n,v,a) the proof is quite analogous and for the 
particular cases of 0"^" and 1”><” the equalities also easily follow. 

(iii) First of all, let us consider the particular case X{n,h,ai^) = (a;y)"j^i 
such that = (1 — 2“")“^ 2“*° for some 1 < io < n. Consequently, r{X{n,h, 
Q!j„)) = (0, . . . , 0, 1, 0, . . . , 0) with the only unit occupying the io-th position, so 
that Xigj = 1 for all 1 < j < n, Xij = 0 for all i ^ io and all 1 < j < n. Let 
X{n,v,(3), P e D„, be any vertical matrix. Then 

ip(^Hx(n,h,aiJ A* X(n,v,p)) = K„J2" =(20) 

= (1 - 2-”)-2 A yi,j) 2-(*“+^') = 

= (1 - 2-”)-i 2-*» [(1 - 2-”)-i Vioj 2-^'] = 

= [a-2-r>5:”^_x<.2-] [(i-2-"r‘5:'^,!,u2-j = 

= ‘p{r{X{n,h,ai„)))(fi{c{X{n,v,P))) = ■ p. 

Let X{n,h,a) = (a;y)"j^i be such that a € D„. Then 
X{n,h,a) = \J^{X{n,h,ak) ■ Xki = 1} 



(21) 
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and X(n, h, at) are mutually disjoint for different k’s. So, given /3 g L>„, matrices 
X{n,h,ak) A* X{n,v,P) are also mutually disjoint for different k’s and 



V* X(n,v,l3)] = 

= X(n,h,a) A* X(n,v,l3). 

Hence, due to the already proved part (i) of this Theorem, 
A‘^\X{n,h,a) A* X{n,v,P)) = 

= ^A‘^\X{n,h,ai) A* X(n,u,/3))] = 

= Ei< < = < 1 [(l-2-”)-i 2-^/3] 

= [(1 - 2-”)-i xa 2-'] [3 = a(3. 

The theorem is proved. 



(22) 



(23) 



□ 



The binary n x n matrices introduced and investigated above will play the 
role of elementary stones in the following construction. 

Definition 4-3 ■ Let N g be a positive integer. Binary diagonal hypermatrix 
of order {N,n) is an x A^-matrix W = (bte), k, £ = 1, . . . , N such that bte G 
Bnxn for each k, £ = 1,2, . . . , N , i. e., each bke is a binary n x n matrix, and 

= 0"^" for all 1 < A: £ < A^. The space of all binary diagonal hypermatrices 

of order {N,n) will be denoted by 'HN,n- The mapping = %N,n (0,oo) is 
defined by (p* {{bke)^^^^) = J2k^i E^i A‘^\bki) (= E^i A‘^\bu), as can be 
easily seen), where is the mapping taking B^xn into (0, 1) and defined above. 

Finally, we have at hand everything necessary to state and prove the asser- 
tion showing that there is a close connection between Boolean-valued Sugeno 
integrals and the classical integrals. Namely, given a function that takes a prob- 
ability space into the unit interval of real numbers, there exists a Boolean-valued 
interpretation of the notions under consideration such that the value of the clas- 
sical integral of the given function can be obtained as the value of the above 
defined projection of the corresponding Boolean-valued Sugeno integral into the 
unit interval of real numbers. 



Theorem 4-3. Let O = { 101 , 102 , ■ ■ ■ ,ion} be a finite nonempty space, let n g 
be fixed, let p : D ^ T>„ be a probability distribution on D, i. e., E^i PAi) = 
let / : 12 Dn be a function. For each 1 < i < A^, let Wp^i = {bki)^i^i G y.N,n 
be such that bu = X{n,h,p{iOi)), bkt = 0"^" for every {k,£) ^ (*,*), and 
let W/,j = {dki)^^i^i G TLN,n be such that bu = X{n,h,p{L 0 i)), bkt = 0”^” 
for every {k,£) ^ (i,i), and let W/,, = {dkt)^ G 'UN,n be such that du = 
X{n,v,f{iOi)), dki = 0”^” for every {k,£) 7^ {i,i). Then 

N 

T* ^fd)) = fAi). (24) 
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Proof. An easy calculation yields that 

A* A* Wf,i) = (25) 

= ^ (X (n, h, p(uji)) A* X(n, u, /(w*))) = 

E N 

1=1 

applying Theorem 4.2 (ill) and recalling the obvious fact that the hypermatrices 
Wp^i A* Wf^i are disjoint for different i’s. □ 

Obviously, the expression A* W/,,) can be taken as the value of 

the Boolean-valued Sugeno integral of the function f ■. fi ^ I (with the values 
f{tOi) encoded by the matrices W/,,) with respect to the possibilistic measure 
jji on finite fi = { 101 , 102 , ■■■ ,u)n} (the particular values /r({o;j}) being defined 
by the matrices Wp^i). This Boolean- valued Sugeno integral takes its values in 
the Boolean algebra defined over 'UN,n by the operations extending V*, A*, and 
Inxn — • in the natural way to %N,n- Applying the mapping to this value 
from 'HN,n, the value P(^i) /(<^*)j i- e-, that of the classical integral of the 
function / with respect to the probability distribution p on 12 is obtained. 

The items [5] and [2] below introduce the notions of the possibilistic measure 
and possibility theory, [1] presents a detailed explanation of the present state 
of this domain. Besides [4], also [3] can be used as an introduction dealing with 
Boolean algebras and related notions. 
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Abstract. Given several Dempster-Shafer belief functions, the frame- 
work of valuation networks describes an efficient method for computing 
the marginal of the combined belief function. The computation is based 
on a message passing scheme in a Markov tree where after the selection of 
a root node an inward and an outward propagation can be distinguished. 
In this paper it will be shown that outward propagation can be replaced 
by another partial inward propagation. In addition it will also be shown 
how the efficiency of inward propagation can be improved. 



1 Introduction 

Dempster-Shafer belief function theory (see [3] and [10]) can be used to represent 
uncertain knowledge. A piece of evidence is encoded by a belief function. Given 
many pieces of evidence represented by t?! , . . . , and one (or many) hypotheses 
H , the problem to solve is to marginalize the combined belief function di ® 
. . . G dm in an efficient way to the set of variables of the hypothesis H . The 
framework of valuation networks [12] turns out to be useful for solving this 
problem because it allows to distribute the belief functions on a Markov tree. 
The computation is then based on a message passing scheme where the two 
main operations combination and marginalization are always performed locally 
on relatively small domains. Depending on the messages which are sent between 
neighboring nodes of the Markov tree, at least four different architectures can 
be distinguished: 

• the Shenoy-Shafer architecture (see [14]), 

• the Lauritzen-Spiegelhalter architecture (see [7] and [8]), 

• the HUGIN architecture (see [4] and [5]), 

• the Fast-Division architecture (see [2]). 

In [16] a comparision of the former three architectures is given for prop- 
agating probability functions. The most popular architecture for propagating 
belief functions is the Shenoy-Shafer architecture. The Lauritzen-Spiegelhalter 
architecture and especially the HUGIN architecture are more interesting in the 
field of Bayesian Networks [9]. Finally, the Fast-Division architecture is an at- 
tempt to apply ideas contained in the HUGIN architecture and the Lauritzen- 

A. Hunter and S. Parsons (Eds.): ECSQARU’99, LNAI 1638, pp. 256-267, 1999. 
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Spiegelhalter architecture for propagating belief functions efficiently. In this pa- 
per the Shenoy-Shafer architecture will be used. However, the ideas presented 
here do not depend on the architecture. 

For each of these architectures an inward propagation phase and an outward 
propagation phase can be distinguished. The main contribution of this paper is 
to show that it is always possible to replace outward propagation by another 
partial inward propagation. 

Furthermore, since only inward propagation is required, it is important to 
speed up inward propagation. The second contribution of this paper is to de- 
velop a heuristic to improve the performance of the inward propagation phase. 
Intermediate results are stored during the first inward propagation phase such 
that the number of combinations needed is minimized. 

The paper is divided as follows: in Section 2, multivariate Dempster-Shafer 
belief functions are introduced. Section 3 describes the Shenoy-Shafer architec- 
ture. In Section 4, it is shown that instead of an inward propagation followed 
by an outward propagation, it is sufficient to perform an inward propagation 
followed by another partial inward propagation. Finally, some improvements to 
speed up inward propagation are presented in Section 5. 



2 Dempster-Shafer Belief Functions 

In this section, some basic notions of multivariate Dempster-Shafer belief func- 
tions are recalled. More information on the Dempster-Shafer theory of evidence 
is given in [3], [10], and [15]. 

Variables and Configurations. We define x as the state space of a vari- 
able X, i.e. the set of values of x. It is assumed that all variables have fi- 
nite state spaces. Upper-case italic letters such as D,E,F,... denote sets of 
variables. Given a set D of variables, let d denote the Cartesian product 
0 = { X : X G D}. 0 is called state space for D. The elements of 0 

are con gurations of D. Upper-case italic letters from the beginning of the 
alphabet such as are used to denote sets of configurations. 

Projection of Sets of Configurations. If D and D' are sets of variables, 
D' D and x is a configuration of D, then denotes the projection of x to 
D' . If H is a subset of 0 , then the projection of A to D\ denoted as A^^ , is 
obtained by projecting each element of A to D' , i.e. A^^ = {x^^ : x G A}. 

Extension of Sets of Configurations. If D and D' are sets of variables, D' 

D, and i? is a subset of 0 ', then B''^ denotes the extension of B to D, i.e. 
B^^ = B 0 \ 0 ,. 



2.1 Different Representations 

A piece of evidence can be encoded by a belief function. Similar to complex 
numbers where c G C can be represented in polar or rectangular form, there are 
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also different ways to represent the information contained in a belief function. It 
can be represented as a mass function, as a belief function, or as a commonality 
function. In this paper commonality functions are not considered. We therefore 
focus on (unnormalized) mass functions and belief functions. The unusual no- 
tation [(p\m and Yp\b is used instead of m,p and bel,p, because in our opinion it 
is more appropriate and convenient. One of the advantages of this notation is 
its ability to distinguish easily between normalized and unnormalized mass or 
belief functions. In accordance with Shafer [II] we speak of potentials when no 
representation specified. We then write only ip, thus without enclosing brackets. 

Mass Function. A mass function [<p\m on D assigns to every subset A of d 
a value in [0, 1]> that is [<p\m '■ — > [0, !]• The following condition must be 

satisfied: 

E = 1- (1) 

A Od 

Intuitively, Yp{A)]m is the weight of evidence for A that has not already 
been assigned to some proper subset of A. Sometimes, a second condition, 
= 0, is imposed. A mass function for which this additional condition 
holds is called normalized, otherwise it is called unnormalized. 

Belief Function. A belief function [(/?]& on D, [(p\b : ^ [Ojl]) can be ob- 

tained in terms of a mass function: 



MA)]s= E IAB)U- ( 2 ) 

B-.B A 

Again, if [v?(0)]b = 0, then the belief function is called normalized. Note that 
a normalized mass function always leads to a normalized belief function and 
vice versa. 



An unnormalized mass or belief function can always be normalized. We write 
[(/?]m and ['p\b when p is normalized. The transformation is given as follows: 

f 0 if A = 0, 

[AA)]m = \ [y(A)U otherwise 
( 1 [v{9i)U otnerwise 

[<p{A)\b b(0)]fc 

1 b(0)]b 

Given a potential (p on D, D is called the domain of p. The sets A £, for 
which [p{A)]m yf 0 are called focal sets. We use FS{p) to denote the focal sets 
of p. 




(3) 

(4) 
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2.2 Basic Operations 

The basic operations for potentials are combination, marginalization, and ex- 
tension. For each of these operations there are representations which are more 
appropriate than others. One advantage of the mass function representation is 
that every basic operation can be performed more or less easily. In this section, 
we will therefore focus on mass functions. 



Combination. Suppose and (^2 are potentials on D\ and D 2 , respectively. 
The combination of these two potentials produces an unnormalized potential 
on 79 = U Z? 2 : 

[ipi ® V2{A)\m. = X! {VPi{Bi)\^-[ip2{B2)\m-BY’nBl^=A}.{b) 

Bi 0D, 

Marginalization. Suppose (/? is a potential on D and D' D. The marginal- 
ization of to D' produces a potential on D': 

Y. ( 6 ) 

A-.Al^'=B 



Extension. Suppose ip is a, potential on D' and D' D. The extension of p to 
D produces a potential on D: 






[ip{B)U ifA = B^^, 
0 otherwise. 



(7) 



3 The Shenoy-Shafer Architecture 

Initially, several potentials i?!, . . . are given. Their domains form a hyper- 
graph. For this hypergraph a covering hypertree [6] has to be computed first. 
Then a Markov tree [1] can be constructed where § 1 ,. . . , 'dm are distributed on 
the nodes. In such a way every node Ni, 1 • i • £, contains a potential (pi on the 
domain Di. Each pi is equal to the combination of some 'dj, and ® . . . 0 'dm 
is always equal to pi ^ ... ^ p(,. 

A Markov tree is the underlying computational structure of the Shenoy- 
Shafer architecture. Computation is then organized as a message passing scheme 
where nodes receive and send messages. 



3.1 Local Computations 

The Dempster-Shafer theory fits perfectly into the framework of valuation net- 
luorks [12]. This framework allows to distribute potentials on a Markov tree and 
to use the Markov tree afterwards as underlying computational structure. 

The framework of valuation networks involves the two operations marginal- 
ization and combination, and three axioms which enable local computation. Po- 
tentials satisfy these axioms: 
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Axiom A1 (Transitivity of marginalization): Suppose is a potential on 
D, and suppose F E D. Then 

ipiP' = . (8) 

Axiom A2 (Commutativity and associativity): Suppose (pi, p 2 i and 
are potentials on Di,D 2 , and D^, respectively. Then 

Pi^P2 = P2^Pi, ( 9 ) 

Pi 0 {P 2 ® Pz) = {pi 0 P 2 ) ® P3- ( 10 ) 

Axiom A3 (Distributivity of marg. over combination): Suppose pi and 
P 2 are potentials on D\ and D 2 , respectively. Then 

{pi ® P2Y’^^ = Pi ® ■ ( 11 ) 

Note above that ”=” really means equality and is not only an equivalence 
relation. 

3.2 Message Passing Scheme 

Given a Markov tree, computation can be organized as a message passing scheme 
where nodes receive and send messages to neighbor nodes. A node can send a 
message to a neighbor node as soon as it has received a message from every other 
neighbor node. In such a way, a propagation can be started where the leave nodes 
can compute and send messages first. At the end of the propagation, every node 
will have received and sent a message from and to each of its neighbor nodes. 

If a node Nk^ has neighbor nodes Nk ^ , . . . , Nk „ , then the message from Nk^ 
to a neighbor node is computed as 

Pkoki = ((Pko G (0{pkjko : 1 • J • n,j ^ (12) 

As soon as Nkg has received a message from each of its neighbor nodes, it can 
compute a potential p'/,^ which is equal to the global potential p := pi0. . .0p^ 
marginalized to the domain 

Pko = Pko ® {®{Pk^ko : 1 • j • n}). ( 13 ) 

Although the Shenoy-Shafer architecture does not impose to designate a root 
node, we will nevertheless designate a node Nr as root node. In practice, every 
node could be selected, but it is better to select a node on which some of the 
queries can be answered. In such a way an inward propagation phase towards the 
root node and an outward propagation phase can be distinguished. After inward 
propagation, the root node is able to compute Pr = p^^'^ . If then an outward 
propagation is performed, then every other node N^g can compute p'f.^ = p-^^^o . 
Often, only a partial outward propagation depending on the query is performed. 
For a set H Oh with Dh Di for a node Ni it is 

[p^^-{H)]b = [p^‘^^{H^^^)]b. ( 14 ) 

Therefore, a node Ni such that Dh Di must be chosen if [p^^’'{H)]b has to 
be computed for a set H Bh ■ 
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3.3 Binary Join Trees 

The Shenoy-Shafer architecture as it is presented above is not very efficient for 
Markov trees where nodes have more than three neighbors. During outward 
propagation unnecessary combinations would be recomputed. This problem can 
be solved by using binary join trees [13] where each node has at most 3 neighbor 
nodes. Binary join trees minimize the number of combinations needed by saving 
intermediate results during propagation. If for example a node has n neighbor 
nodes, it would need n ■ {n 1) combinations to generate all messages of this 
node. In contrast, a corresponding binary join tree would only need 3n 4 
combinations. Fortunately, every Markov tree can be transformed into a binary 
join tree by introducing new nodes. For this purpose, for every node with n > 3 
neighbor nodes it is sufficient to create n 3 new nodes. 



4 An Alternative to Outward Propagation 

Inward propagation followed by an outward propagation corresponds in a way to 
a complete compilation of the given knowledge allowing then to answer queries of 
the user very quickly. In the following we propose a method which corresponds 
only to a partial compilation of the given knowledge. This partial compilation 
results from the inward propagation phase where intermediate results are stored. 
Later, instead of performing a complete outward propagation phase, a partial 
inward propagation is performed for every query. 



4.1 The Basic Idea 



To understand the basic idea of the method presented in this paper, it is better 
to forget for a while the Shenoy-Shafer architecture. Suppose the potential Lp 
has the domain d{ip) = D and suppose H d is given. Then, according to 
Equation 4 






Vp{H)]b VpW\b 
1 [v{^)]b 



(15) 



As shown by the following theorem, there is another way to compute the same 
result: 



Theorem 1. Suppose ip has domain d{(p) = D and suppose H 
If V is such that [u(iJ'^)]m = 1 and d{v) = D, then 



MH)]b 



[(y: 0 u)(0)]^ b(0)]m 

1 b(0)]m 



Proof. It is always [¥’(0)]h = Vp{^)]m- In addition 



\p{H)]b = ^ ^ [p{A)]^ 

A H AnH'==$ 

= = [{P ® U)(0)]m 



D is given. 



(16) 



(17) 
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Therefore, the values Ci = [(/?(0)]„ and C 2 = [(v? 0 u)(0)]m are sufficient for 
calculating [ip{H)]B- 

Our method tries to compute these two values as fast as possible, that’s why 
we use the Shenoy-Shafer architecture. c\ is computed after the first inward 
propagation phase. If during inward propagation intermediate results are stored 
on the nodes, then only a partial inward propagation is needed for C 2 . 

4.2 Restrictions on Queries 

There is a price we have to pay when the Shenoy-Shafer architecture is used: 
(which is equal to (iL)]s) can only be computed for sets H 

B>h when the underlying Markov tree contains a node Ni with domain Di such 
that Dh Di. If this is not the case, then a new Markov tree has to be built. 

There is a second restriction which may be more severe: Our method is not 
able to compute queries of the form and For a given set 

H Dh, our method computes only the value whereas traditional 

methods compute the entire mass function 

4.3 Inward Propagation Phase 

Prior to the inward propagation phase a root node Nr has to be selected. In 
practice, every node could be selected, but it is better to select a node on which 
some of the queries can be answered. Often, the first query is used to determine 
the root node. 

After a root node Nr is designated, an inward propagation towards Nr is 
started. Every node Nk^ with neighbor nodes Nk^ , ■ ■ ■ , Nk„ where Nk„ denotes 
the node on the path to the root node then computes 

V>ko = {vko ® Vkiko o • • • o ifeo) • (18) 

The message <fkok„ is computed by an additional marginalization followed 
by an extension as 

f ~lDkgnDkr\''^'^’' 

‘Pkokr = J • (19) 

The inward propagation phase is finished as soon as Nr has combined his 
potential with every incoming message yielding (pr which is equal to The 

value Cl = [v3l^’'(0)]m is of particular interest since it is used afterwards for 
normalization. Therefore, this value has to be stored somewhere. 

4.4 Partial Inward Propagation 

Suppose for H the query [ip^^^{H)]b has to be answered. For this 

purpose, first a node Ni with domain Di such that Dh Di has to be selected. 
If there is no node with this property, then a new Markov tree has to be built. 
But often there may be several nodes Ni such that Dh Di. Then it is best to 
choose the one which is closest to the root node Nr. 
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Now, a new potential v with d{v) = and [v{H'^)]m = 1 has to be con- 
structed and adjoined to the node Ni. The idea behind v is derived from propo- 
sitional logic where for a set of sentences S and another sentence h 

S \= h S,^h 1= _L. (20) 

A partial inward propagation is then initiated. Note that only the nodes on 
the path from Ni to Nr have to perform computations if intermediate results are 
stored during the first inward propagation phase. The partial inward propagation 
is finished as soon as the root node Nr has received a message and has computed 
(fir - It is then 

iPr = {'f ( 21 ) 

Of particular interest is the value [{(p ® u)^'°''(0)]m because it is used to 
compute If we let ci denote [(/?I^’'(0)]m obtained from the first 

inward propagation phase and C 2 = \{p> O u)^^’'(0)]m from the partial inward 
propagation, then the following theorem holds: 

Theorem 2. Suppose H 'with Dh Di for some node N^. In addition, 

suppose that ci and C 2 are de ned as explained above. Then 

(22) 

1 Cl 

Proof. According to Subsection 4.1 and if v is such that [v{H‘^)]m = 1 and 
d{v) = Dh, then 











1 


[((^0 


u)I^'*(0)]m 




1 b^^'-(0)]m 


[(<p® 


f)(0)]m b(0)]m 




1 [</5(0)]m 


[(<^(g) 





1 

£2 Cl 

1 Cl 



(23) 

(24) 

(25) 

(26) 
(27) 



Partial inward propagation is needed to be able to compute [(v5Of)^^’'(0)]m 
on the root node Nr. If during the first inward propagation phase no intermediate 
results would have been stored, then a complete inward propagation would be 
necessary to compute [{<p ® {tb)\m. Note that partial inward propagation is 

quite fast since the newly constructed potential v contains only one focal set. 
Therefore, adjoining v to the node 7\£ generally simplifies computations. 
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4.5 The Algorithm 

Suppose a Markov tree with nodes N^, . . . ,N( is given where each node 
contains a potential ipi and let ip := ipi ® ... ® denote the global potential. 
For each Hk in {Hi, ..., iJ^}, Hi , Dh^, Di, the query (iLfc)]s 

has to be answered. The complete algorithm is then given as follows: 

Select root node Nr 
Inward Propagation : ip'r = 

Define a := 

If Cl = 1 then Exit 

for each Hk in {Hi, . . . , Hm] do 

Select node A* such that Di 

Build V such that d{v) = Dh^ and [v{Hk‘^)]m = 1 

Adjoin V on node Ni 

Partial Inward Propagation : ip” = (p 0 
Define C 2 := [(v? ® 

Output [pl^>^.(Hk)]B= ^ 

Next Hk 

Note that if ci = 1 is obtained then the given knowledge is completely con- 
tradictory. In this case there is nothing to do since C 2 would also be equal to 1. 
Otherwise each query can be answered by a partial inward propagation. Note 
that when a node Ni has to be selected for a query it is best to choose the node 
which is closest to the root node Nr. 

5 Some Improvements 

5.1 Improving the Inward Propagation 

In the previous section it has been shown that an inward propagation phase 
followed by another partial inward propagation is sufficient to answer a query. 
In order to answer queries as fast as possible it is therefore important that 
inward propagation does not need too much time. In this subsection an algorithm 
is presented which minimizes the number of combinations needed by storing 
intermediate results during the first inward propagation phase. In addition the 
algorithm tries to determine a good sequence of combinations in order to speed 
up inward propagation. 

Suppose Nkg has neighbor nodes Nk^ , . . . , Nk„ and suppose Nk„ is the node 
towards the root node (if Nkg is itself the root node, then a new root node 
connected only to Nkg has to be virtually inserted). During inward propagation 
Nkg has to compute pkg = Pkg ® ^kikg 0 . ■ .0 Pk„ ikg- Because of associativity 
and commutativity there are many ways to compute this. Although the final 
result will always be the same there may be big differences in the time needed. 
Because the time needed to combine two potentials is correlated to the number 
of focal sets it seems natural to combine first potentials possessing fewer focal 
sets. This heuristic is used in the following algorithm: 
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:= {Vko \ U {Vkiko : 1 • i <n} 

loop 

choose G S' and such that 

\FS{'&k)\ • |i^S'(C)| for alU G S' and 
|F^(fo)|. |F5(C)| foralUGf \{t?4 

f = If' U {t?fc 0 fo} \ fo} 
until jif'l = 1 

Every node uses this algorithm to combine its incoming messages. At the end 
the set If' contains only the potential (fko ■ If Nkg is the root node, then (pko is equal 
to ^ otherwise the message ipkokn to the inward neighbor Nk„ can easily 

be computed using Equation 19 by performing an additional marginalization 
followed by an extension. 

The algorithm can be visualized when for every combination a new node is 
created. If N^g has n > 1 neighbor nodes, then there are exactly n 1 new nodes 
with domain Dkg created. Each of these newly created nodes serves to store the 
corresponding intermediate result. As an example in Figure 1 two trees are shown 
for a given node with 4 neighbor nodes. The tree on the left corresponds to the 
computation of {{{(pkg ® Vk^kg) ® fk 2 kg) ® <^k 3 ko) whereas the one on the right 
corresponds to {{(pkg <8> (fikikg) ® {(fikgkg ® ‘Pkakg))- 




Fig. 1. Visualizing the Algorithm. 



The method presented here corresponds to the technique of binary join trees 
because both methods minimize the number of combinations needed by stor- 
ing intermediate results during inward propagation. But here the nodes of the 
Markov tree are not restricted to at most 3 neighbor nodes. But because com- 
bination is a binary operation the visualization of the algorithm leads always to 
a binary join tree. 
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In [17] another method is presented which minimizes the number of combina- 
tions by saving intermediate results during propagation. Every message from the 
outward neighbors of are combined accumulatively with the potential of Nkg 
itself. In Figure 1 the binary join tree on the left where the newly created nodes 
are on a line corresponds to this method. If n denotes the number of neighbor 
nodes of Nkg then there are 0^2 ( 2 ) “ possible connections. Only one 

of them corresponds to the method presented in [17] whereas our method tries 
to select a good connection depending on ipkg and the incoming messages (fikikg- 

5.2 Selection of nodes 

In the algorithm presented in Subsection 4.5 a node Ni has to be selected on 
which a newly constructed potential v has to be adjoint. As already mentioned 
there may be many nodes Ni which could be used for this purpose but it is 
best to select the one which is closest to the root node N^- When the heuristic 
presented in the previous subsection is used during the first inward propagation 
phase then node Ni has stored some intermediate results, each of them with the 
domain Di. To adjoin v it is once again best to select the intermediate result 
which is closest to the root node. In such a manner the way to the root node is 
as short as possible. 

6 Conclusion 

We have shown that given a Markov tree a query can be answered by an inward 
propagation followed by another partial inward propagation instead of an inward 
propagation followed by an outward propagation. In order to answer queries as 
fast as possible it is therefore important to speed up inward propagation. By 
storing intermediate results during the first inward propagation phase and the 
use of a heuristic which combines potentials possessing fewer focal sets inward 
propagation can be improved. 
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Abstract. In default logic, possible sets of conclusions from a default 
theory are given in terms of extensions of that theory. Each such exten- 
sion is generated through a set of defaults rules. In this paper, we are 
concerned with identifying default rules belonging to all sets of default 
rules generating different extensions. This is interesting from several per- 
spectives. First, it allows for approximating the set of so-called skeptical 
conclusions of a default theory, that is, those conclusions belonging to 
all extensions. Second, it provides a technique usable for pre-processing 
default theories, because such default rules are applicable without know- 
ing nor altering the extensions of the initial theory. The fact that our 
technique leaves the resulting conclusions unaffected makes it thus ap- 
plicable as a universal pre-processing tool to all sorts of computational 
tasks. 



1 Introduction 

Default logic [6] is one of the best known and most widely studied formaliza- 
tions of default reasoning due to its very expressive and lucid language. In default 
logic, knowledge is represented as a default theory, which consists of a set of for- 
mulas and a set of default rules for representing default information. Possible 
sets of conclusions from a default theory are given in terms of extensions of that 
theory. A default theory can possess zero, one, or multiple extensions because 
different ways of resolving conflicts among default rules lead to different alterna- 
tive extensions. In fact, any such extension is uniquely characterizable through 
its so-called set of generating default rules. 

In this paper, we are concerned with identifying default rules belonging to 
all sets of generating default rules. This is interesting from several perspectives. 
First, it allows for approximating the set of so-called skeptical conclusions of 
a default theory, that is, those conclusions belonging to all extensions. Second, 
it provides a technique usable for pre-processing default theories, because such 
default rules are applicable without knowing nor altering the extensions of the 
initial theory. The fact that our technique leaves the resulting conclusions unaf- 
fected makes it thus applicable as a universal pre-processing tool to all sorts of 
computational tasks, starting from testing existence of extensions, over any sort 
of query-answering, up to the computation of entire extensions. 
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For accomplishing our task, we draw on the notions of blocking sets and 
block graphs introduced in [4]. There, these concepts were used for characteriz- 
ing default theories guaranteeing the existence of extensions and for supporting 
query-answering. Here, we exploit the information gathered in block graphs for 
the aforementioned pre-processing task. 



2 Background 



A default rule is an expression of the form . We sometimes denote the prereq- 
uisite ct of a default rule 5 by p{S), its justification /3 by j(S) and its consequent 
7 by c(<5).^ A set of default rules D and a set of formulas W form a default 
theory^ A = (D,W), that may induce one, multiple or no extensions: 

Definition 1. [6] Let A = (D,W) be a default theory. For any set of formulas 
S, let F^iS) be the smallest set of formulas S' such that 

1. W C S', 

2. Th(S') =S', 

3. For any G D, if a £ S' and -i/3 ^ S then 7 € S' . 

A set of formulas E is an extension of A iff F^lyE) = E. 

Observe that i? is a fixed-point of Any such extension represents a possible 
set of beliefs about the world. 

For simplicity, we assume for the rest of the paper that default theories 
{D,W) comprise finite sets only. Additionally, we assume that for each default 
rule 5 in D, we have that W U j{6) is consistent. This can be done without 
loss of generality, because we can clearly eliminate ah rules <5 from D for which 
W U j{6) is inconsistent, without altering the set of extensions. We call two 
default theories extension equivalent, if they have the same extensions. 

Consider the standard example where birds fly, birds have wings, penguins 
are birds, and penguins don’t fly along with a formalization through default 
theory {Di,Wi) where 



Di 




b : w p : b p : -<abp 1 
w ' b ' -■/ J 



( 1 ) 



and Wi = {- 1 / —> abb,f abp,p}. We let Sf, S^,, Si,, S^f abbreviate the previous 
default rules by appeal to their consequents. Our example yields two extensions, 
viz. El = Th{Wi U { 6 , w, ~'f}) and E 2 = Th{Wi U { 6 , w,f}), while theory (Di U 
i±^},Wi) has no extension. Adding to (1) eliminates extension i? 2 . 

Further, define for a set of formulas S and a set of defaults D, the set of 
generating default rules as GD{D, S) = {<5 € D | 5 h p{S) and S \f . We 



^ This generalizes to sets of default rules in the obvious way. 

^ If clear from the context, we sometimes refer with A to {D,W) or its components 
D and W (and vice versa) without mention. 
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see that the two extensions of (1) are generated by GD{D,Ei) = <5;,, <5-,/} 

and GD{D, E^) = {Sf,S^„,Sb}, respectively. Following [7], we call a set of default 
rules D grounded in a set of formulas S iff there exists an enumeration 
of D such that we have for alH € / that S U c({<5o, • • • , h p{Si). Note 

that E = Th{W Uc{GD{D,E))) forms an extension of {D,W) if GD{D,E) is 
grounded in W. As proposed by [3,1], we call a set of default rules D weakly 
regular wrt a set of formulas S iff we have for each 6 £ D that S U c{D) \f 
A default logic is said to enjoy one of these properties according to its 
treatment of default rules that generate extensions. Reiter’s default logic leads 
to grounded and weakly regular sets of generating default rules. 

For capturing the interaction between default rules under weak regularity, 
[4] introduced the concept of bloeking sets: 

Definition 2. [4] Let A = (D,W) be a default theory. For S £ D and BCD, 
we define 

B as a potential bloeking set of S, written B S, iff 

- W U c{B) I — ij{S) and 

- B is grounded in W. 

B is an essential bloeking set of 5, written B A/i S, iff 

- B S and 

- {B \ {(5'}) S” for no S' £ B and no S" £ B U {<5}. 

Let Ba{S) = {B \ B be the set of all essential blocking sets of <5. These 

blocking sets provide eandidate sets for denying the application of <5. We give 
the sets of blocking sets obtained in our example at the end of this section. 

In what follows, we let the term blocking set refer to essential blocking sets. 
This is justified by a result in [5] , showing that essential blocking sets are indeed 
sufficient for characterizing the notion of consistency used in Reiter’s default 
logic:"^ For a set D' C D grounded in W, we have that D' is weakly regular wrt 
W iff we have for each S' £ D' and each B' c D' that B' ^ Ba{S'). 

The problem with blocking sets is that there may be exponentially many in 
the worst case. This is why [4] put forward the notion of a bloek graph, as a 
compact abstraction of actual blocking sets. 

Definition 3. [4] Let A = {D,W) be a default theory. The bloek graph G{A) = 
(VA,A/i) of A is a direeted graph with vertiees Va = D and ares 

Aa = {((5^(5) \ S' £ B for some B £ Ba{S)} . 

(Recall that a directed graph G is a pair G = {V,A) such that R is a finite, 
non-empty set of vertices and A C R x R is a set of arcs.) We observe that the 
space complexity of block graphs is quadratic in the number of default rules; 

® Opposed to this, strongly regular stipulates S U c{D) U j{D) \f _L. 

We let B' C B stand for B' C B and B' B. 
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its construction® faces the same time complexity as the extension-membership- 
problem [2]; they are both in ■ Note that the efforts put into constructing a 
block graph are however meant to amortize over subsequent tasks; notably its 
construction (and reduction, see operation “|” below) are both incremental. 

A crucial notion is that of a cycle in a block graph: A directed cycle in 
G = {V, A) is a finite subset C CV such that C = {v \ , • • • , u„} and (u,, Uj+i) € A 
for each 1 <i < n and (u„,ui) € A. The length of a directed cycle in a graph 
is the total number of arcs occurring in the cycle. Additionally, we call a cycle 
even (odd) if its length is even (odd). Accordingly, a default theory is said to 
be non-conflicting, well-ordered or even, depending on whether its block graph 
has no arcs, no cycles or only even cycles, respectively. [4] show that these three 
classes guarantee the existence of extensions. 

Default theory (1) yields the following blocking sets: 

BAiSf) ={{<5./}} 

Ba{Sw) =0 
BAiSb) =0 
BA{S^f) = {{S,,Sf}} 

We get a block graph with vertice set Di (indicated by white nodes) and (solid) 
arcs {S^f,Sf), {Sf,S^f) and (<5;,, <5-,/) : 




The addition of Sab^ = ^ abl*' augments BA{Sf) by We get 

additionally BA{Sabb) = 0j reflecting the fact that Sabi, i® unblockable, that is, 
applicable without consistency check. The addition to the block graph is indi- 
cated by (lightgray) node Sabi and (dashed) arc (Sabi,Sf). The further addition 
of S^x = ^ to (1) leaves the above blocking sets unaffected and yields addi- 
tionally Ba{S^x) = reflecting self blockage. This leads to an additional 

(lightgray) node S^x and a (dotted) odd loop {S^xiS^x) in the augmented block 
graph. 

Finally, we need the following technical instrument for (what we call) 
shifting: Let A = (D,W) be a default theory and D' CD. We define 
A\D' as {D\{D' UW) , W Uc{D')) where = {,5 g D | W U c(D') h 
-ij(<5)}. Intuitively, the shift operation A\D' results in a default theory, sim- 
ulating the application of the rule set D' to theory A. The purpose of D' is to 
eliminate defaults whose justification is inconsistent with the facts of A\D' . In 
fact, [5] shows that a set E of formulas is an extension of some default theory A 
iff it is an extension of A\D' , provided that D' C GD{D,E) is grounded in W . 

® That is, a corresponding decision problem. 
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3 Pre-processing 

Our aim is now to exploit the information gathered in a block graph for iden- 
tifying default rules belonging to all sets of generating default rules. We begin 
with the formal foundations for this endeavor. 

Definition 4. Let A = (D,W) be a default theory and D' C D. 

We define D' C D as aetive in A iff D' is grounded in W and for all S' € D' 
there is no S G D sueh that (S,S') € A^- 

Active rules, say, are thus applicable without consistency check since they possess 
no blocking sets. The block graph of our exemplary theory {Di,Wi) indicates 
two active sets: {<5;,} and {<5;>, While the addition of S^x results in the same 
(putatively) active sets, the addition of Sabi, yields additionally {(iaij}, {<^6, (iaij}, 
and {(i;,, It should be clear that the union of all active sets is also an 

active set. Activeness provides us with the following two results: 

Theorem 1. Let A = {D,W) he a default theory and D' C D. 

If D' is an aetive set in A, then D' C GD(D,E) for eaeh extension E of A. 

Provided that D' is active in A, we have that all consequences of W U c(D') 
do belong to all extensions of A; they form a subset of the so-called skeptical 
theorems of A (given by the intersection of all extensions). 

Now we may consider a bottom-up approach shifting all active default knowl- 
edge to the set of facts. This approach is backed up by the next theorem: 

Theorem 2. Let A = (D,W) be default theory and D' C D. 

If D' is an aetive set in A, then A is extension equivalent to A\D' . 

The applications of this result are manifold. In particular, it leads also to a 
reduction of the block graph: When computing the block graph G{A), we may 
keep track of all active blocking sets. If we let UB denote their union, we may 
then build A\UB, reduce G{A) to G{A\UB) and perform all subsequent tasks, 
like extension-construction or query-answering, relative to A\UB and G{A\UB). 
We make this intuition precise in the sequel. 

First, we need the following definition. Let G = {V, A) be a directed graph. 
A node n € F is a souree of G if there is no v' such that (n',n) € A, that 
is, there are no arcs pointing on sources. We denote the set of all sources of a 
graph Ghy Sq- The idea is now to repeatedly apply the default rules indicated 
by the sources of the block graph. We capture the single application of all such 
rules in the following way. 

Definition 5. Let A be a default theory and G(A) its bloek graph. 

We define T(A) as the maximal set T(A) C being grounded in W. 

Observe that T{A) is unique and, according to Definition 4, it is an active set 
of default rules. With Definition 5 we are able to identify default rules common 
to all sets of generating default rules: 
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Definition 6. Let A = {D,W) he a default theory. 

Define Aq = A, Cq = T(Ao) and for i > 0 

Ai-^-i = Ai\Ci and Ci+i =T(zi,_|_i) . 

Then, CD(D,W) = U*>o defined as the set of eore default rules of A. 

We note that CD{D, W) is finite and well-defined, since A is supposed to be 
finite. Also, observe the difference between CD{D,W) and GD{D,E): While 
the latter depends on an existing extension, the former relies “merely” on the 
initial theory (through its block graph). We put ‘merely’ in quotes, since the 
computation of a block graph is in the worst case as expensive as that of an 
extension. The difference is however that an extension is an object of “on-line” 
computations, while a block graph (as well as the set of core default rules) is 
meant to be computed “off-line”, so that the efforts put into its construction 
amortize over subsequent tasks.® 

The idea is now to repeatedly extract the default rules indicated by the 
sources of the block graph and to shift the consequents of the “groundable” ones 
to the facts of the theory in focus. This process is repeated until all applicable 
rules at the sources have been applied. For example, we get 

CD(D,,Wi) = {Si,S,,} and CD(Di U Wi) = {<5^, <5^, J . 

We now elaborate upon the theories resulting from this process. To begin 
with, we obtain the following corollary to theorems 1 and 2: 

Corollary 1. Let A = (D,W) be default theory. 

Then, A is extension equivalent to A\CD(D,W). 

Shifting the core rules in our first two exemplary theories yields the following 
ones: 



{D,,Wi)\CD{D,,Wi) = {{Sf,S^f},WiU{b,w}) 

{Di U {5aH},Wi)\CD{D^ U Wi) = ({<5./}, Wi U {6, w, abt,}) . 

While the first theory results in a cyclic block graph, given on the left, the second 
one yields an arcless, single-noded graph, given on the right: 





® Note also that there may be an exponential number of extensions, while there is 
always a single block graph. 
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According to the type of (initial) default theory (well-ordered, even or gen- 
eral) CD{D, W) plays a different role in skeptical default reasoning. 

Theorem 3. Let A be a well-ordered default theory. 

Then, CD{D,W) = GD{D,E) for the (single) extension E of A. 

Clearly, this applies also to non-conflicting theories. Otherwise, we have: 

Theorem 4. Let A be a default theory. 

Then, we have CD{D,W) C GD{D,E) for eaeh extension E of A. 

Observe that this result integrates the case where A has no extension; exten- 
sions are guaranteed for even theories (cf. Section 2). In order to actually use 
GD{D, W) for skeptical reasoning, however, one has to ensure that A\ GD{D, W) 
has an extension because there may be remaining (destructive) odd cycles in 
block graph G{A\GD{D ,W)) which kill all possible extensions of A\GD{D ,W) 
(and with it all extensions of A). For instance, (T>i U {S^x},Wi) results in 
GD{Di U {S^x},Wi) = although it has no extenions. 

Next, we therefore elaborate upon the structure of A\GD{D,W). 

Theorem 5. Let A be a default theory. 

If G{A\GD{D ,W)) is aeyelie, then G{A\GD{D ,W)) areless. 

That is, shifting the set of core rules yields block graphs having either no arcs 
at all or at least one cycle. While the first case amounts to computing the 
single extension of the underlying theory, the cycle(s) obtained in the second 
case indicate the different choices that lead to multiple putative extensions. The 
former can be made precise as follows. 

Corollary 2. Let A be a default theory. 

If G{A\GD(yD ,W)) is aeyelie, then A has a single extension. 

Existing cycles in G{A\GD{D, W)) may however still lead to none, one or multi- 
ple extensions. For this take TF = 0 and in turn D = D = {^, 

and D = {^, ^}, respectively. 

Clearly, if A is well-ordered, then G{A\GD{D,W)) is arcless. The next the- 
orem shows what happens, if A is even. 

Theorem 6. If A is an even default theory, then G{A\ GD(D, W)) is either even 
or areless. 

In both cases, A\GD{D,W) is thus also guaranteed to have some extension. 

Finally, we obtain the following recursive procedure for computing the set of 
core default rules: 

Procedure eore( A = (D,W) : Default theory ) 

1. eonstruet the bloek graph G(A) of A 

2. eompute T{A) from G(A) 

3. ifT{A) = 0 then return 0 

else return eore{A\T{A)) UT(zi). 
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The next theorem shows that procedure core computes CD{D,W)-. 

Theorem 7. If A is a default theory, then CD{D,W) = core{A). 

Although the computation of CD{D, W) is meant as an off-line compilation pro- 
cess, one may be reluctant to repeatedly compute the block graph at each recur- 
sive level. There are two constructive answers to this objection: First, G(zi|{( 5 }) 
is computable from G{A) by deleting arcs starting at defaults belonging to the 
same blocking sets as < 5 . These are indicated by the arcs in the block graph. 
Second, one may approximate at each level the exact block graph G{A\D') by 
simply deleting vertices D' U D' (recall that D' = {S & D \ W U c{D') I — 
and their connecting arcs in the block graph of the original theory. Denoting 
this modified procedure by core' (A), it is easy to show that core' (A) C core{A) 
because G{A\D') has the same vertices but less arcs and thus more sources than 
the graph obtained by simply deleting vertices D' U D' . 

4 Breaking even cycles 

The idea put forward in the last section was to pre-process a default theory 
by starting at the outer nodes, namely the sources, of the block graph and by 
subsequently working inwards until we encounter either a cycle or no further 
arcs. While the latter case results in an extension, the former one may leave 
back rather large parts of the block graph that remain unexplored. This section 
furnishes a method for breaking up certain even cycles in order to make their 
successor nodes accessible to the pre-processing approach given in the previous 
section. 

We need the following graph theoretical concepts. For a directed graph 
G = (y,A) and vertex u € U, define the reachable predecessors of v as jiv) = 
Uj>o 7 *(^) where 7°(u) = {u} and 7*(u) = {u \ (u,w) € A and w € 7*“^(u)} for 
i > 1. Then, for subset S CV, define 7(5) = Furthermore, we need: 

Definition 7. Let G = (y,A) he a directed graph and C CV a cycle in G. C 
is a dominating cycle in G iff C C 7(G) is the only cycle in the subgraph in G 
which is induced by nodes 7(G). 

That is, there is no other cycle between the predecessors of G; G is thus a cycle 
placed at the “outskirts” of the graph. For instance, is a dominating 

cycle in the block graph of {Di,Wi). Observe that such cycles do not necessarily 
exist, since cycles may themselves depend upon each other in a cyclic way. 

Even (dominating) cycles are decomposable in a natural way. 

Definition 8. Let G = (E, A) he a directed graph and C = {vi, . . . ,Vk} an 
even cycle in G. We define the canonical partition of C as (Gi,G2) where C\ = 
{ui,U3, . . . ,U2„-i} and C2 = {^2,^4, . . . ,U2„} for 2 n = k. 

By definition, we have Gi 0 G2 =0 because Gi contains all nodes with even 
indices and G2 contains all nodes with odd indices. The canonical partition of 
the even cycle in the block graph of G((T>i,Wi)) is 
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Interestingly, each part of such a partition induces a different extension of 
the underlying theory: 

Theorem 8. Let A he an even default theory sueh that G(A\CD(D,W)) eon- 
tains a dominating eyele C and let {C\,C 2 ) he the eanonieal partition of C . 

Then C\ C GD(D,Ei) and C 2 C GD(D,E 2 ) for different extensions Ei and 
E 2 of A. 

In our example, <5-,/ belongs to GD{D,Ei) = <5;,, while Sf gives rise to 

GD{D,E2) = {Sf,S^,St}. 

This can be put together to the following procedure: 

Procedure eore'^i A = (D,W) : Default theory ) 

1. Let CD = eore{A) and A' = A\CD /* A' = A\ CD{D, W) */ 

2. if G(A') has dominating eyele C 

then for the eanonieal partition (C'i,C' 2 ) of C 
return CD U (core+(zi'|C'i) fl core+(zi'|C' 2 )) 
else return CD. 

We have the following result. 

Theorem 9. If A is a default theory, then eore{A) C eore'^(A) C GD{D,E) 
for eaeh extension E of A. 

For illustration, let us extent our running example (1) by defaults, expressing 
that “birds that can be assumed to be flight-less are ratites” and “flying birds and 
ratites are animals” (leaving facts W\ = {-■/ ahb,f ahp,p} unchanged): 

f j h : -<ahi, h : w p : h p : -<ahp 1 i h : ^f (6 A /) V r : a 

where 5r and 5a abbreviate the new default rules. We get two extensions 
El = Th(ITi U {6, w, - 1 /, r, a}) and E 2 = T/i(ITi U {6, w,/, a}). Note that 5a 
contributes to both extensions. It is easy to verify that the default theory has 
the following block graph: 





Let us now step through procedure core+ : As with our initial default theory, we 
obtain core rules CD = {<5;,, <5„,} resulting in A' = {{5f,5^f,5r,5a}, Wi U{6, w}). 
This yields the following reduced blockgraph G(A'): 




On Bottom-Up Pre-processing Techniques for Automated Default Reasoning 277 




We get thus an even cycle C = that dominates rules 5r and <5a, 

so that they are outside the scope of our pre-processing techniques. While 5r 
is still blockable, 5a is not grounded. Now, the canonical partition of C is 
From this, we get (i) A'\Ci = {{Sr,Sa},WiU {b, w}U {-./}) along 
with core+(zi'|C'i) = and (ii) A'\C 2 = {{5a}, Wi U {b,w} U {/}) along 

with core+(zi'|C'i) = {<5a}. Intersecting the results from both recursions gives 
{(5a }, so that our original call to core+returns {(5;,, (5„,} U |(5a}. 

Although this technique allows pre-processing beyond dominating even cy- 
cles, as exemplified by 5a, it should be applied with care since there may be an 
exponential number of such cycles in the worst case. 

5 Concluding remarks 

We have exploited the notion of a block graph for identifying default rules belong- 
ing to all sets of default rules generating different extensions. This has provided 
us with a technique that is applicable as a universal pre-processing tool to all 
sorts of computational tasks. Although this proceeding faces the same compu- 
tational complexity as the computation of an extension, it should be seen as an 
investment that amortizes over subsequent computations. In the future, it will 
be interesting to see in how far more general classes of even cycles or yet odd 
loops can be broken up, so that one might eventually end up with a method for 
computing entire extensions. 
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Abstract. In this paper, we focus on the combination of probabilistic 
logic programming with the principle of maximum entropy. We start by 
defining probabilistic queries to probabilistic logic programs and their 
answer substitutions under maximum entropy. We then present an ef- 
ficient linear programming characterization for the problem of deciding 
whether a probabilistic logic program is satisfiable. Finally, and as a 
central contribution of this paper, we introduce an efficient technique for 
approximative probabilistic logic programming under maximum entropy. 
This technique reduces the original entropy maximization task to solving 
a modified and relatively small optimization problem. 



1 Introduction 

Probabilistic propositional logics and their various dialects are thoroughly stud- 
ied in the literature (see especially [19] and [5]; see also [15] and [16]). Their ex- 
tensions to probabilistic first-order logics can be classified into first-order logics 
in which probabilities are defined over the domain and those in which probabil- 
ities are given over a set of possible worlds (see especially [2] and [9]). The first 
ones are suitable for describing statistical knowledge, while the latter are appro- 
priate for representing degrees of belief. The same classification holds for existing 
approaches to probabilistic logic programming: Ng [17] concentrates on probabil- 
ities over the domain. Subrahmanian and his group (see especially [18] and [4]) 
focus on annotation-based approaches to degrees of belief. Poole [22] , Haddawy 
[8], and .laeger [10] discuss approaches to degrees of belief close to Bayesian net- 
works [21]. Finally, another approach to probabilistic logic programming with 
degrees of belief, which is especially directed towards efficient implementations, 
has recently been introduced in [14] . 

Usually, the available probabilistic knowledge does not suffice to specify com- 
pletely a distribution. In this case, applying the principle of maximum entropy is 
a well-appreciated means of probabilistic inference, both from a statistical and 
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from a logical point of view. Entropy is an information-theoretical measure [26] 
reflecting the indeterminateness inherent to a distribution. Given some consist- 
ent probabilistic knowledge, the principle of maximum entropy chooses as the 
most appropriate representation the one distribution among all distributions 
satisfying that knowledge which has maximum entropy (ME). Within a rich 
statistical first-order language, Grove et al. [7] show that this ME-distribution 
may be taken to compute degrees of belief of formulas. Paris and Vencovska [20] 
investigate the foundations of consistent probabilistic inference and set up postu- 
lates that characterize ME-inference uniquely within that framework. A similar 
result was stated in [27], based on optimization theory. Jaynes [11] regarded 
the ME-principle as a special case of a more general principle for translating 
information into a probability assignment. Recently, the principle of maximum 
entropy has been proved to be the most appropriate principle for dealing with 
conditionals [13] (that is, using the notions of the present paper, ground prob- 
abilistic clauses of the form {H\B)[ci,C 2 \ with ci = 02 ). 

The main idea of this paper is to combine probabilistic logic programming 
with the principle of maximum entropy. We thus follow an old idea already 
stated in the work by Nilsson [19], however, lifted to the first-order framework of 
probabilistic logic programs. At first sight, this project might seem an intractable 
task, since already probabilistic propositional logics under maximum entropy 
suffer from efficiency problems (which are due to an exponential number of 
possible worlds in the number of propositional variables). In this paper, however, 
we will see that this is not the case. More precisely, we will show that the efficient 
approach to probabilistic logic programming in [14], combined with new ideas, 
can be extended to an efficient approach to probabilistic logic programming 
under maximum entropy. Roughly speaking, the probabilistic logic programs 
presented in [14] generally carry an additional structure that can successfully be 
exploited in both classical probabilistic query processing and probabilistic query 
processing under maximum entropy. 

The main contributions of this paper can be summarized as follows: 

• We define probabilistic queries to probabilistic logic programs and their cor- 
rect and tight answer substitutions under maximum entropy. 

• We present an efficient linear programming characterization for the problem 
of deciding whether a probabilistic logic program is satisfiable. 

• We introduce an efficient technique for approximative probabilistic logic pro- 
gramming under maximum entropy. More precisely, this technique reduces 
the original entropy maximizations to relatively small optimization prob- 
lems, which can easily be solved by existing ME-technology. 

The rest of this paper is organized as follows. Section 2 introduces the tech- 
nical background. In Section 3, we give an example. Section 4 concentrates on 
deciding the satisfiability of probabilistic logic programs. In Section 5, we discuss 
probabilistic logic programming under maximum entropy itself. Section 6 finally 
summarizes the main results and gives an outlook on future research. 

All proofs are given in full detail in the appendix. 
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2 Probabilistic Logic Programs and Maximum Entropy 

Let be a first-order vocabulary that contains a finite and nonempty set of 
predicate symbols and a finite and nonempty set of constant symbols (that is, 
we do not consider function symbols in this framework). Let A” be a set of object 
variables and bound variables. Object variables represent elements of a certain 
domain, while bound variables describe real numbers in the interval [0, 1]. 

An object term is a constant symbol from <P or an object variable from X. An 
atomic formula is an expression of the kind p(ti ,tk) with a predicate symbol 
p of arity k >0 from <P and object terms , . . . , . A conjunctive formula is the 

false formula _L, the true formula T, or the conjunction Ai A • • • A A/ of atomic 
formulas Ai,...,A; with f > 0. A probabilistic clause is an expression of the 
form (7L|B)[ci, C 2 ] with real numbers ci,C 2 G [0,1] and conjunctive formulas H 
and B different from _L. A probabilistic program clause is a probabilistic clause 
(7L |B)[ci , C 2 ] with Cl < C 2 - We call H its head and B its body. A probabilistic 
logic program "P is a finite set of probabilistic program clauses. 

Probabilistic program clauses can be classified into facts, rules, and con- 
straints as follows: facts are probabilistic program clauses of the form (iL |T)[ci, C2] 
with C 2 > 0, rules are of the form (7L |B)[ci , C 2 ] with B ^ T and C 2 > 0, and 
constraints are of the kind (Lf|il)[0,0]. Probabilistic program clauses can also 
be divided into logical and purely probabilistic program clauses: logical program 
clauses are probabilistic program clauses of the kind (Lf|S)[l, 1] or {H\B)[0, 0], 
while purely probabilistic program clauses are of the form (7L|S)[ci, C2] with 
Cl < 1 and C 2 > 0. We abbreviate the logical program clauses (7L|S)[1, 1] and 
(7L|B)[0, 0] by H B and _L •<— Lf A B, respectively. 

The semantics of probabilistic clauses is defined by a possible worlds se- 
mantics in which each possible world is identified with a Herbrand interpretation 
of the classical first-order language for ^ and X (that is, with a subset of the 
Herbrand base over #). Hence, the set of possible worlds T<g is the set of all sub- 
sets of the Herbrand base HB^. A variable assignment maps each object variable 
to an element of the Herbrand universe HU 4 . and each bound variable to a real 
number from [0, 1]. For Herbrand interpretations 7, conjunctive formulas C, and 
variable assignments cr, we write I \=„ C to denote that C is true in I under a. 

A probabilistic interpretation Pr is a mapping from Tjg to [0, 1] such that all 
Pr{I) with I £Xp sum up to 1. The probability of a conjunctive formula C in the 
interpretation Pr under a variable assignment cr, denoted Pr^iC), is defined as 
follows (we write Pr{C) if C is variable-free): 

PrAC)= E Pr{I)- 

A probabilistic clause (77|H)[ci , C2] is true in the interpretation Pr under a vari- 
able assignment cr, denoted Pr \=^ (77|i?)[ci , C2], iff ci -PraiB) < Pr^iPt A B) < 
C 2 • Pra{B). A probabilistic clause (77|H)[ci , C2] is true in Pr, denoted Pr \= 
(77|H)[ci, C2], iff Pr \=„ (77|H)[ci, C2] for all variable assignments cr. 

A probabilistic interpretation Pr is called a model of a probabilistic clause F 
iff Pr \= F. It is a model of a set of probabilistic clauses F, denoted Pr |= F, iff 
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Pr is a model of all probabilistic clauses in T. A set of probabilistic clauses T 
is satisfiable iff a model of T exists. 

Object terms, conjunctive formulas, and probabilistic clauses are ground iff 
they do not contain any variables. The notions of substitutions, ground substi- 
tutions, and ground instances of probabilistic clauses are canonically defined. 
Given a probabilistic logic program V, we use ground{V) to denote the set of all 
ground instances of probabilistic program clauses in V . Moreover, we identify 
with the vocabulary of all predicate and constant symbols that occur in V. 

The maximum entropy model (ME-model) of a satisfiable set of probabilistic 
clauses E, denoted MElE], is the unique probabilistic interpretation Pr that is a 
model of T and that has the greatest entropy among all the models of T, where 
the entropy of Pr, denoted H{Pr), is defined by: 

H{Pr) = -J2 Pr{I) ■ log Pr(J) . 

A probabilistic clause P is a maximum entropy consequence (ME-consequence) 
of a set of probabilistic clauses E, denoted E \=* F, iff ME[E] |= F. A probab- 
ilistic clause (H\B)[ci,C 2 \ is a tight maximum entropy consequence (tight ME- 
consequence) of a set of probabilistic clauses P, denoted E {H\B)[ci,C 2 ], 

iff Cl is the minimum and C 2 is the maximum of all ME^ [P] {B AB) / ME^j \E\ (B) 
with ME ^[E]{B) > 0 and variable assignments a. 

A probabilistic query is an expression of the form 3(B\B)[ci , C 2 ] or of the form 
3{B\B)[xi,X2] with real numbers ci,C 2 G [0, 1] such that Ci < C 2 , two different 
bound variables a;i,X 2 G X, and conjunctive formulas B and B different from T. 
A probabilistic query 3{B\B)[ti,t2] is object-ground iff B and B are ground. 

Given a probabilistic query 3(7T|P)[ci , C 2 ] with ci,C 2 G [0, 1] to a probabil- 
istic logic program P, we are interested in its correct maximum entropy answer 
substitutions (correct ME-answer substitutions)^ which are substitutions 9 such 
that P |=* {B9\B9)[ci,C2] and that 0 acts only on variables in 3(iT|P)[ci, C 2 ]. Its 
ME-answer is Yes if a correct ME-answer substitution exists and No otherwise. 
Whereas, given a probabilistic query 3{B\B)[xi,X2] with a;i,a ;2 G A to a prob- 
abilistic logic program P, we are interested in its tight maximum entropy answer 
substitutions (tight ME-answer substitutions), which are substitutions 9 such 
that P {B9\B9)[xi9,X29], that 9 acts only on variables in 3{B\B)[xi,X2], 

and that X\9,X29 G [0, 1]. Note that for probabilistic queries 3{B\B)[xi,X2] with 
X\,X 2 G X, there always exist tight ME-answer substitutions (in particular, 
object-ground probabilistic queries 3{B\B)[xi,X2] with X\,X 2 G A always have 
a unique tight ME-answer substitution) . 

3 Example 

We give an example adapted from [14]. Let us assume that John wants to pick 
up Mary after she stopped working. To do so, he must drive from his home to 
her office. However, he left quite late. So, he is wondering if he can still reach her 
in time. Unfortunately, since it is rush hour, it is very probable that he runs into 
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a traffic jam. Now, John has the following knowledge at hand: given a road (ro) 
in the south (so) of the town, he knows that the probability that he can reach 
(re) S through R without running into a traffic jam is 90% (1). A friend just 
called him and gave him advice (ad) about some roads without any significant 
traffic (2) . He also clearly knows that if he can reach S through T and T through 
R, both without running into a traffic jam, then he can also reach S through R 
without running into a traffic jam (3). This knowledge can be expressed by the 
following probabilistic rules {R, S, and T are object variables): 

{re{R, S) I ro(J?, S) A so{R, 5))[0.9, 0.9] (1) 

{re{R, S) I ro{R, S) A ad{R, 5))[1, 1] (2) 

(re(J?,5)|re(J?,T)Are(T,5))[l,l]. (3) 

Some self-explaining probabilistic facts are given as follows {h, a, 6, and o are 
constant symbols; the fourth clause describes the fact that John is not sure 
anymore whether or not his friend was talking about the road from a to b): 

(ro(h, a) I T)[l, 1], (ad(h, a) \ T)[l, 1] 

(ro(a, b) I T)[l, 1], {ad{a, b) \ T)[0.8, 0.8] 

(ro(6, o) 1 T)[l, 1], (so(fo, o) ] T)[l, 1] . 

John is wondering whether he can reach Mary’s office from his home, such that 
the probability of him running into a traffic jam is smaller than 1%. This can 
be expressed by the probabilistic query 3(re(/i, o))[.99, 1]. His wondering about 
the probability of reaching the office, without running into a traffic jam, can be 
expressed by 3{re{h, o))[Xi , X 2 ], where X\ and X 2 are bound variables. 

4 Satisfiability 

In this section, we concentrate on the problem of deciding whether a probabil- 
istic logic program is satisfiable. Note that while classical logic programs without 
negation and logical constraints (see especially [1]) are always satisfiable, prob- 
abilistic logic programs may become unsatisfiable, just for logical inconsistencies 
through logical constraints or, more generally, for probabilistic inconsistencies 
in the assumed probability ranges. 

4.1 Naive Linear Programming Characterization 

The satisfiability of a probabilistic logic program V can be characterized in a 
straightforward way by the solvability of a system of linear constraints as follows. 
Let be the least set of linear constraints over j// > 0 (/ GT<g) containing: 

( 1 ) ?// = 1 

( 2 ) c\ yi < Yiei^,i\=HAB yi ^ yi 

for all {H\B)[ci,C 2 ] G ground(V). 
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It is now easy to see that V is satisfiable iff LC 4 , is solvable. The crux with this 
naive characterization is that the number of variables and of linear constraints 
is linear in the cardinality ofT# and of ground{V), respectively. Thus, especially 
the number of variables is generally quite large, as the following example shows. 

Example 4.1 Let us take the probabilistic logic program V that comprises 
all the probabilistic program clauses given in Section 3. If we characterize the 
satisfiability of V in the described naive way, then we get a system of linear 
constraints that has 2®'^ 18 • 10^® (!) variables and 205 linear constraints. 



4.2 Reduced Linear Programming Characterization 

We now present a new system of linear constraints to characterize the satisfiab- 
ility of a probabilistic logic program V. This new system generally has a much 
lower size than CC^. In detail, we combine some ideas from [14] with the idea of 
partitioning ground (V) into active and inactive ground instances, which yields 
another substantial increase of efficiency. We need some preparations: 

Let P denote the set of all logical program clauses in V. Let V denote the least 
set of logical program clauses that contains iL •<— B if the program V contains a 
probabilistic program clause (B|B)[ci,C2] with C2 > 0. 

We define a mapping R that maps each ground conjunctive formula C to a 
subset of U {T} as follows. If C* = T, then R{C) is HB$ U {T}. If C* T, 
then R{C) is the set of all ground atomic formulas that occur in C. 

For a set £ of logical program clauses, we define the operator Tc"[oj on the 
set of all subsets of HB^. U {T} as usual. For this task, we need the immediate 
consequence operator Tc, which is defined as follows. For all I C HB 4 , U {T}: 

Tc{I) = U {R(H) I B ^ B e ground (C) with R{B) C 1} . 

For all I C HB$ U {T}, we define as the union of all Tctn{I) with 

n<uj, where Tct0(7) = I and T^t in + 1)(7) = TciTc"[n{I)) for all n<uj.We 
adopt the usual convention to abbreviate Tctce{^) by Tc"[a. 

The set ground{V) is now partitioned into active and inactive ground in- 
stances as follows. A ground instance (B|B)[ci,C2j G ground{V) is active if 
7?(B) U 7?(B) C and inactive otherwise. We use active{V) to denote the 

set of all active ground instances of ground{V). 

We are now ready to define the index set T-p of the variables in the new 
system of linear constraints. It is defined by Tp = T), n2<g, where T'-p is the least 
set of subsets of HB$ U {T} with: 

(a) Tpt LO € T'-p, 

(P) Tpp lo(R( B)) ,T pp u! (R(H) Li R(B)) G T'p for all purely probabilistic program 
clauses (B|B)[ci,C2j G activeiLV), 

(7) Tp^Loih U h) G T'p for all h,h& T'p. 

The index set Tp just involves atomic formulas from Tp'\uj-. 
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Lemma 4.2 It holds I C T:p\ oj for all I € I-p. 

The new system of linear constraints CC-p itself is defined as follows: CCp is the 
least set of linear constraints over yi>0 {I £lp) that contains: 

( 1 ) Yjieiv yi ~ ^ 

(2) Cl - yi ^ YlieXv,! \= hab yi ^ ■ S/gi-p,/|=b yi 

for all purely probabilistic program clauses {H\B)[ci,C 2 ] G active{V). 

We now roughly describe the ideas that carry us to the new linear constraints. 
The first idea is to just introduce a variable for each / C and not for each 

/ C HB$ anymore. This also means to introduce a linear constraint only for 
each member of active{V), and not for each member of ground(V) anymore. The 
second idea is to exploit all logical program clauses in V . That is, to just introduce 
a variable for each T-pfto{I) with I C Tzpfuj. This also means to introduce a 
linear constraint only for each purely probabilistic member of activeiV). Finally, 
the third idea is to exploit the structure of all purely probabilistic members of 
activelfP). That is, to just introduce a variable for each I £lp. 

The following important theorem shows the correctness of these ideas. 

Theorem 4.3 V is satisfiable iff CCp is solvable. 

We give an example to illustrate the new system of linear constraints CCp. 

Example 4.4 Let us take again the probabilistic logic program V that com- 
prises all the probabilistic program clauses given in Section 3. The system LCp 
then consists of five linear constraints over /oMr variables J/* > 0 (i G [0:3]): 



2/0 + 2/1 + 2/2 + 2/3 = 1 

0-9 • ( 2/0 + 2/1 + 2/2 + 2 / 3 ) < 2/1 + 2/3 

0-9 • ( 2/0 + 2/1 + 2/2 + 2 / 3 ) > 2/1 + 2/3 

0-8 • ( 2/0 + 2/1 + 2/2 + 2 / 3 ) < 2/2 + 2/3 

0-8 • ( 2/0 + 2/1 + 2/2 + 2 / 3 ) > 2/2 + 2/3 

More precisely, the variables yi {i G [0:3]) correspond as follows to the mem- 
bers of Xp (written in binary as subsets of T^pfuj = {ro{h,a),ro{a,b),ro{b,o), 
ad(h, a), ad (a, b), so(b, o), re{h, a), re(a, b), re(b, o), re(h, b), re{a, o), re{h, o)}): 

yo = 111101100000, yi = 111101101000 

1/2 = 111111110100, 1/3 = 111111111111. 

Moreover, the four linear inequalities correspond to the following two active 
ground instances of purely probabilistic program clauses in V : 



{re{b, o) \ ro{b, o) A so{b, o))[0.9, 0.9], {ad{a, b) \ T)[0.8, 0.8] . 
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5 Probabilistic Logic Programming under ME 

In this section, we concentrate on the problem of computing tight ME-answer 
substitutions for probabilistic queries to probabilistic logic programs. Since every 
general probabilistic query can be reduced to a finite number of object-ground 
probabilistic queries, we restrict our attention to object-ground queries. 

In the sequel, let V he a satisfiable probabilistic logic program and let Q = 
3{G\A)[xi,X2] be an object-ground query with X\,X 2 G X. To provide the tight 
ME-answer substitution for Q, we now need ME\P]{A) and ME[P]{G A A). 

5.1 Exact ME-Models 

The ME-model of V can be computed in a straightforward way by solving the 
following entropy maximization problem over the variables yi >0 {I GT<g): 



max — ^ yjlogyi subject to . (4) 

/SI® 

However, as we already know from Section 4.1, the set of linear constraints CC^ 
has a number of variables and of linear constraints that is linear in the cardinality 
of T# and of ground{V), respectively. That is, especially the number of variables 
of the entropy maximization (4) is generally quite large. 

Example 5.1 Let us take again the probabilistic logic program V that com- 
prises all the clauses given in Section 3. The entropy maximization (4) is done 
subject to a system of 205 linear constraints over 2^^ 18 • 10^® variables. 

5.2 Approximative ME-Models 

We now introduce approximative ME-models, which are characterized by optim- 
ization problems that generally have a much smaller size than (4). 

Like the linear programs in Section 4.1, the optimization problems (4) suffer 
especially from a large number of variables. It is thus natural to wonder whether 
the reduction technique of Section 4.2 also applies to (4). 

This is indeed the case, if we make the following two assumptions: 

(1) All ground atomic formulas in Q belong to T^'\lo. 

(2) Instead of computing the ME-model of P, we compute the ME-model of 
active{V) (that is, we approximate ME[P] by ME[active{'P)]). 

Note that V can be considered a logical approximation of the probabilistic logic 
program V. This logical approximation of V does not logically entail any other 
ground atomic formulas than those in Tpj'n'- Hence, both assumptions (1) and 
(2) are just small restrictions (from a logic programming point of view) . 

To compute ME[active{V)]{A) and ME[active{V)]{G A A), we now have to 
adapt the technical notions of Section 4.2 as follows. 

The index set X-p must be adapted by also incorporating the structure of the 
query Q into its definition. More precisely, the new index set Xv,q is defined by 
Xp^Q = XipQ n T<g, where Tpg is the least set of subsets of EfB^. U {T} with: 
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(a) Trti^,T2tu;{R{A)),Trtu;{R{G)UR{A))€l'^Q, 

(P) T-pp Lo{R{B)),T-pp uj{R{H) U R{B)) e l-'pQ for all purely probabilistic 
program clauses , C2] G active{V), 

( 7 ) Tpp Lo{Ii U I2) G RpQ for all G RpQ- 

Also the new index set Rv,q just involves atomic formulas from 

Lemma 5.2 It holds I C T:p\ oj for all I G Iv,Q- 

The system of linear constraints CCp must be adapted to CCp^Q, which is the 
least set of linear constraints over yi >0 {I &XpQ) that contains: 

(f) Jlieiv.Q yi — ^ 

(2) Cl • Ylieiv.Q ,i\=Byi — '^ieiv.Q,i\= hab yi — ''^ieiv,Q,i\=B yi 
for all purely probabilistic program clauses (_H'|B)[ci , C 2 ] G active{V). 

Finally, we need the following definitions. Let Ip = {L | L C and let aj 

(/ GT-p^g) be the number of all possible worlds J G Tj^toifLp) 02^ that are a su- 
perset of I and that are not a superset of any K G Rv,q that properly includes I. 

Roughly speaking, Ip,Q defines a partition {Sj \ I G of Tp'|' w(2p) flip 

and each a/ with I &Rv,q denotes the cardinality of Si. Note especially that 
a/ > 0 for all I G2p,g, since Iv,Q Q Tvp^iPp) Hip. 

We are now ready to characterize the ME-model of active (V) by the optimal 
solution of a reduced optimization problem. 

Theorem 5.3 For all ground conjunctive formulas C with Tpp uj(R(C)) 

ME[active{V)]{C) = E/eipo. 2// > 

where yf with I G g is the optimal solution of the following optimization prob- 
lem over the variables yi >0 with I GTp,g: 

max - E yi (log 2// - log a/) subject to CCp^q . (5) 

IEIv.q 

The tight ME-answer substitution for the probabilistic query Q to the ground 
probabilistic logic program active{V) is more precisely given as follows. 

Corollary 5.4 Let yf with I G2p,g be the optimal solution of (5). 

O’) If y*i =0 for all I G2p,g with I \= A, then the tight ME-answer substitution 
for the query 3(G\A)[xi , X 2 ] to active{V) is given by {xi/1, 2 : 2 / 0 }. 

Ify*i > 0 for some I G2p,g with I \= A, then the tight ME-answer substitution 
for the query 3(G|A)[2;i , X 2 ] to active{V) is given by {x\/d,X 2 ld}, where 

^ ~ E/GlpQ,/|=GA.4 2// / E/GipQ, i\=A y*i ■ 

We give an example to illustrate the optimization problem (5). 
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Example 5.5 Let us take again the probabilistic logic program V from Sec- 
tion 3. The tight ME-answer substitution for the query 3(re(/i, o))[Xi , X2] to 
active{V) is given by {Xi/0. 9353, X2/0. 9353}, since ME[active{'P)]{re(h, 0)) = 
2/3 + 2/4 + 2/5 + 2/6 = 0.9353, where y* {i £ [0 : 6]) is the optimal solution of the 
following optimization problem over the variables y* > 0 (i € [0 : 6]): 

6 

max - Vi (log2/i ~ logo*) subject to CC-p,Q , 

where (00,01,02,03,04,05,00) is given by (3, 1, 1, 1, 6, 5, 2) and CC-p^Q consists 
of the following five linear constraints over the seven variables y* > 0 (i £ [0 : 6]): 

2/0 + 2/1 + 2/2 + 2/3 + 2/4 + 2/5 + 2/6 = 1 
0-9 • {yo + 2/1 + 2/2 + 2/3 + 2/4 + 2/5 + Ve ) < 2/i + 2/3 + 2/5 

0-9 • {yo + 2/1 + 2/2 + 2/3 + 2/4 + 2/5 + Ve ) > 2/i + 2/3 + 2/5 

0-8 • {yo + 2/1 + 2/2 + 2/3 + 2/4 + 2/5 + Ve ) < 2/2 + 2/3 + 2/6 

0-8 • {yo + 2/1 + 2/2 + 2/3 + 2/4 + 2/5 + Ve ) > 2/2 + 2/3 + 2/6 

More precisely, the variables yt {i £ [0:6]) correspond as follows to the mem- 
bers of I-p,Q (written in binary as subsets of = {^o{h, o), ro(o, 6), ro{b, o), 

ad{h, o), ad {a, 6), so{b, o), re{h, o), re(o, 6), re{b, o), re{h, 6), re(o, o), re{h, o)}): 

yo = 111101100000 , 1/1 = 111101101000 , 1/2 = 111111110100 , ya = 111111111111 
yi = 111101100001 , 2/5 = 111101101001 , ye = 111111110101 . 

Furthermore, the variables yt {i £ [0:6]) correspond as follows to the mem- 
bers of Tpt w(2ip) n2p (written in binary as subsets of Tp'j'w). Note that a, 
with i £ [0:6] is given by the number of members associated with y*. 

yo = (111101100000, 111101100100, 111101110100) 

2/1 = (111101101000), 2/2 = (111111110100), 2/3 = (111111111111) 

2/4 = (111101100001, 111101100011, 111101100101, 

111101100111 , 111101110101 , 111101110111 ) 

2/5 = ( 111101101001 , 111101101011 , 111101101101 , 
111101101111 , 111101111111 ) 

2/6 = ( 111111110101 , 111111110111 ) . 

Finally, the four linear inequalities correspond to the following two active ground 
instances of purely probabilistic program clauses in V : 

{re{b, o) \ ro{b, o) A so{b, o))[0.9, 0.9], {ad{a, b) \ T)[0.8, 0.8] . 

Note that we used the ME-system shell SPIRIT (see especially [23] and [24]) 
to compute the ME-model of active{V). 
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5.3 Computing Approximative ME-Models 

We now briefly discuss the probiem of computing the numbers aj with I G '^v,Q 
and the probiem of soiving the optimization probiem (5). 

As far as the numbers a/ are concerned, we just have to soive two iinear 
equations. For this purpose, we need the new index set defined by I^q = 
n Zg, where is the ieast set of subsets of HB$ U {_L} with: 

(a) $,R{A),RiG)UR{A) 

(13) R{B),R{H) U R{B) G T^q for aii (Ff|B)[ci, ca] G active{V), 

(7) Ji U 7a e 7 :" Q for aii 7i , 7a G 

We start by computing the numbers sj with .7 £ X^q, which are the unique 
soiution of the foiiowing system of iinear equations: 

SjgiJq.JC/ bl for aii 7 G Xp Q . 

We are now ready to compute the numbers aj with .7 G Xp^Q, which are the 
unique soiution of the foiiowing system of iinear equations: 

for aii 7 G . 

As far as the optimization probiem (5) is concerned, we can buiid on existing 
ME-technoiogy. For exampie, the ME-system PIT (see [6] and [25]) soives en- 
tropy maximization probiems subject to indifferent possibie worids (which are 
known to have the same probabiiity in ME-modeis) . It can thus directly be used 
to solve the optimization problem (5). 

Note also that if the probabilistic logic program V contains just probabilistic 
program clauses of the form (77|7I)[ci , C 2 ] with ci = C 2 , then the optimization 
problem (5) can easily be solved by standard Lagrangean techniques (as de- 
scribed in [24] and [25] for entropy maximization). 

6 Summary and Outlook 

In this paper, we discussed the combination of probabilistic logic programming 
with the principle of maximum entropy. We presented an efficient linear pro- 
gramming characterization for the problem of deciding whether a probabilistic 
logic program is satisflable. Furthermore, we especially introduced an efficient 
technique for approximative query processing under maximum entropy. 

A very interesting topic of future research is to analyze the relationship 
between the ideas of this paper and the characterization of the principle of 
maximum entropy in the framework of conditionals given in [13]. 

Appendix 

Proof of Lemma 4.2: The claim is proved by induction on the definition of 
X'-p as follows. Let I £Xp with 7 HB$ U {T}. 
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(a) If / = then / C since T^^co C 

{(3) If / = T-p^ uj{R{B)) or / = Tp\ uj{R{H)\JR{B)) for some purely probabilistic 

(7 T|B)[ci,C 2] G active{V), then I C since R(H) U R(B) C 

(7) If / = Tptw(/i U 12) for some Ii,h G Ip, then I C since /i,/2 ^ 

HB 4 , U {±} and thus 7i U /2 C T^'\uj by the induction hypothesis. □ 

Proof of Theorem 4.3: We first need some preparation as follows. We show 
that all purely probabilistic program clauses from active{V) can be interpreted 
by probability functions over a partition {Sj \ I G Tp} of Tp\ oj{T,p) CiRp. That 
is, as far as active{V) is concerned, we do not need the fine granulation of I<g. 

For all I £ Ip let Si be the set of all possible worlds J G Tp^ uj{I^) C\I^ 
that are a superset of I and that are not a superset of any K £ Ip that properly 
includes I. We now show that {Sj \I G Ip} is a partition of TpytLi{Ip) Hip. As- 
sume first that there are two different h^h G Ip and some J G Tpyoj{Ip) nip 
with J G <S/j n Si^. Then J D RU I 2 and thus J D U R)- Moreover, it 

holds Tptw(/i U/2) G Ip by (7) and Tp^oj{Ii UR) D 7i or Tp'|'w(/i U/2) D R. 
But this contradicts the assumption J G <S/j fl Sj^ . Assume next that there are 
some J G Tp^ioilp) nip that do not belong to I ^ ^ ^v}- We now con- 

struct an infinite chain R C R C ■ ■ ■ of elements of Ip as follows. Let us define 
R = It then holds R G Ip by (a) and also J D R. But, since J ^ Sj^, 

there must be some R G Ip with J D R and R DR. This argumentation can 
now be continued in an infinite way. However, the number of subsets of HBp is 
finite and we have thus arrived at a contradiction. 

We next show that for all I £ Ip, all possible worlds J G Tp^ uj(Ip) nip, 
and all ground conjunctive formulas C with Tp^ uj{R{C)) £lp, it holds J \= C 
for some J G <S/ iff J |= C* for all J £ Si. Let .J \= C for some J G 5/. It then holds 
J DI, J D R{C), and thus .IDTp^ uj{R{C)). We now show that IDTp^ uj{R{C)). 
Assume first / C Tp^Lo{R{C)). But this contradicts J G Si. Suppose next that 
I 2 Tp^uj{R(C)) and I 2 Tv^uj{R{C)). Since J D / U Tpt‘^(-R(C')), we get 
J D Tp^uj{I U Tp^ lo{R{C))). Moreover, it holds Tp^ uj{I U Tp^ lo{R(C))) £ Ip 
by (7) and UTp} w(i?(C))) D I. But this contradicts J £ Si. Hence, we 

get I D Tp^ (jj{R{C)). Since J D I for all J £ Si, we thus get J D R{C) for all 
J £ Si. That is, J |= C* for all J £ Si. The converse trivially holds. 

We are now ready to prove the theorem as follows. Let Pr be a model of V. 
Let yi {/ G Ip) be defined as the sum of all Pr{.J) with ./ £ Si. It is now easy 
to see that yi {I £ Ip) is a solution of CCp. 

Conversely, let yi {I £ Ip) be a solution of LCp. Let the probabilistic in- 
terpretation Pr be defined by Pr{I) = yi ii I £ Ip and Pr{I) = 0 otherwise. 
It is easy to see that Pr is a model of all logical program clauses in V and of 
all purely probabilistic program clauses in activeiV). Let us now take a purely 
probabilistic program clause (77|B)[ci , C2] from ground{V)\actwe{V). Assume 
that R{B) contains some Bi ^ T^'\lo. By Lemma 4.2, we then get Bi ^ I for 
all I £ Ip. Hence, it holds Pr(B) = 0 and thus Pr \= (77 |B)[ci , C2]. Suppose 
now that 77(77) C Iptw and that 77(77) contains some 77, ^ But this 

contradicts the assumption C2 > 0. That is, Pr is a model of V. □ 
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Proof of Lemma 5.2: The claim can be proved like Lemma 4.2 (by induction 
on the definition of Iv,q)- The proof makes use of R{G) U R{A) C □ 

Proof of Theorem 5.3: Since active{V) does not involve any other atomic 
formulas than those in we can restrict our attention to probability func- 
tions over the set of possible worlds = {L\L C Like in the proof 

of Theorem 4.3, we need some preparations as follows. We show that all purely 
probabilistic program clauses from active{V) can be interpreted by probability 
functions over a partition {Sj \ I G Tp.g} of 2^: 

For all I G X-p^Q let Si be the set of all possible worlds J G Tp^toiX^) CiX^ 
that are a superset of / and that are not a superset of any K G XpQ that properly 
includes I. By an argumentation like in the proof of Theorem 4.3, it can easily 
be shown that {Sj \ I G 2p,q} is a partition of Tp^uj{X-:p) n and that for all 
I G XpQ, all possible worlds J G Tpti^(Tp) H , and all ground conjunctive 
formulas C with Tp^ uj{R{C)) G Xp^Q, it holds J |= C for some J G 5/ iff ./ |= C 
for all J G <S/. 

Given a model Pr of active(V), we can thus define a model Pr* of active{V) 
by Pr*{L) = 1/a/ • '^j^^Pr{J)if L G Tpj'o;(Tp)n2':p, where I G Xp^q such that 
L £ Si, and Pr*{L) = 0 otherwise. Hence, it immediately follows that for all 
I &XpQ and all J\,J 2 &Si: ME[active{V)]{Ji) = ME[active{V)]{J 2 ) ■ 

Hence, for all ground conjunctive formulas C with TpytLi{R{C)) £Xp^Q-. 

ME[activel(P)\{C) = T,ieXv.Q,i^c 

where x} (/ £Xpq) is the optimal solution of the following optimization problem 
over the variables xi >0 {I £Xpq)-. 

max — ^ a/a;/ log a;/ subject to XCp Q , 

where XC'pQ is the least set of constraints over xi >0 {I £Xp^q) containing: 

( 1 ) 0.1X1 = 1 

(2) oixi ^ Y.ieiv.Q,i\=HAB oixi < C2 • X]/gIi=,q, b 

for all purely probabilistic program clauses (iL|H)[ci , C2] G active{V). 

Thus, we finally just have to perform the variable substitution x/ = ?///a/. □ 
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Abstract. The efficiency of algorithms for probabilistic inference in 
Bayesian networks can be improved by exploiting independence of causal 
influence. In this paper we propose a method to exploit independence of 
causal influence based on on-line construction of decomposition trees. 
The efficiency of inference is improved by exploiting independence rela- 
tions induced by evidence during decomposition tree construction. We 
also show how a factorized representation of independence of causal influ- 
ence can be derived from a local expression language. The factorized rep- 
resentation is shown to fit ideally with the lazy propagation framework. 
Empirical results indicate that considerable efficiency improvements can 
be expected if either the decomposition trees are constructed on-line or 
the factorized representation is used. 



1 Introduction 

Bayesian networks is an increasingly popular knowledge representation frame- 
work for reasoning under uncertainty. A Bayesian network consists of a directed, 
acyclic graph and a set of conditional probability distributions. For each variable 
in the graph there is a conditional probability distribution of the variable given 
its parents. The most common task performed on a Bayesian network is calcula- 
tion of the posterior marginal distribution for all non-evidence variables given a 
set of evidence. The complexity of probabilistic inference in Bayesian networks 
is, however, known to be MV-hsxA in general [2]. In this paper we consider how 
a special kind of structure within the conditional probability distributions can 
be exploited to reduce the complexity of inference in Bayesian networks. The 
structure we want to exploit is present when the parents of a common child 
interact on the child independently. This is referred to as independence of causal 
influence (ICI), see eg. [4-6, 11, 12, 14]. 

There exists basically two different approaches to inference in Bayesian net- 
works. One approach is based on building a junction tree representation T of the 
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Bayesian network Q and then propagating evidence in T by message passing. T 
is constructed based on an off-line determined elimination order of the variables 
in Q. The other approach is based on direct computation. Direct computation is 
based on determining the elimination order or order of pairwise combination of 
potentials on-line. Q is pruned to remove all variables not relevant for the query 
of interest and an elimination order of the variables or a combination order of 
the potentials of the pruned graph is determined. 

Lazy propagation is a recently proposed junction tree based inference architec- 
ture [8]. A junction tree representation T of t/ is constructed based on moral- 
ization and triangulation of Q. The nodes of T are cliques (maximal, complete 
subgraphs) of the triangulation of Q. The distributions of Q are associated with 
cliques of T such that the domain of a distribution is a subset of the clique 
with which it is associated. In the lazy propagation architecture distributions 
associated with a clique C are not combined to form the initial clique potential. 
Instead, a multiplicative decomposition of each clique and separator potential is 
maintained and potentials are only combined during inference when necessary. 

Inference in T is based on message passing between adjacent cliques of T. Each 
message consists of a set of potentials with domains that are subsets of the 
separator between the two adjacent cliques. A message from Cj to Cj over S is 
computed from a subset, of the potentials associated with Ci. $s consists of 
the potentials relevant for calculating the joint of S. All variables in domains of 
potentials in #5, but not in S are eliminated one at a time by marginalization. 
The factorized decomposition of clique and separator potentials and the use of 
direct computation to calculate messages make it possible to take advantage of 
barren variables and independence relations induced by evidence during infer- 
ence. The efficiency of the lazy propagation algorithm can also be improved by 
exploiting ICI. 

Definition 1. The parent eause variables C'i,...,C'„ of a eommon ehild ejfeet 
variable E are causally independent wrt. E if there exists a set of eontribution 
variables . . ., with the same domain as E sueh that: 

1,... E ^ A- \Cj^E ^ j- and 

— E = E'^'^*- ■ ■*E'^'" where * is a eommutative and assoeiative binary operator. 

With definition 1, P{E | Ci , . . . , C„) can be expressed as a sum over the product 
of a set of usually much simpler probability distributions: 



P{E\pa{E))= E \{P{E^^\Ci). 

For each cause variable in an ICI model some state is designated to be distin- 
guished. For most real-world models this will be the state bearing no effect on 
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E [6]. In general, ICI is exploited when cause variables are eliminated before the 
effect variable E as eliminating E first will reconstruct the entire conditional 
probability distribution P{E | Ci , . . . , C„). 



2 Lazy Parent Divorcing 

Parent divorcing [9] is a method to exploit ICI based on changing the structure 
of the Bayesian network before the junction tree representation is constructed. 
Let Q he a Bayesian network and let M be an ICI model in Q with variables E, 
Cl , . . . , C„ and conditional probability distribution = P{E\ Ci , . . . , C„). 
With parent divorcing a number of variables, say Ij,... ,Ym, are introduced 
as intermediate variables between Ci , . . . , C„ and E. The variables Ij , . . . , Tm 
are introduced to cache the intermediate results of combining cause variables 
pairwise. The cause, effect, and intermediate variables are arranged in a de- 
composition tree constructed once and off-line (in the case of n = 4 and 
computation order E = ((Ci * C^) * (C3 * C4))): 



P(C|Ci,C2,C3,C4)= 5] 

{Y^,Y2\E^Y^*Y2} 

P{E^^\Ci)P{E^^\C2) 
Y P{E^^\Cz)P{E^*\Ci). 

{S'^3 ,£;C4 I Y2=£;C3 *£;C4 j. 



Knowledge about the possible structure of the evidence, say £, and the impact 
of the structure of on the remaining part of Q should be considered when 
is constructed. There exist, however, examples where it off-line is impossible 
to construct such that the space requirements are minimal given different £. 




P( 43 | Cl, C2, C3) 




P(4|Ci,C2) P(B|Ci,C3) P(C|C2,C3) 

P(Ci) P(C2) P(Ca) 

(B) 



Fig. 1: A Bayesian network Q and corresponding junction tree T. 
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Example 1. Consider the Bayesian network Q shown in figure 1(A). Let M be 
the ICI model describing the relationship between E and its parents. Three 
different decomposition trees , Tg , and can be constructed using parent 
divorcing on M, see figure 2. 

By considering three different evidence scenarios E^, and Ez it will become 
clear that it is impossible off-line to construct a decomposition tree with minimal 
space requirements for all three scenarios. That is, for each of Tjf, and Tq 
there exists a set of evidence that makes one of the other trees a better choice. 
As an example figure 3 shows a junction tree T corresponding to constructing 
Tg off-line. \i E\ = {A = a,B = b}, then the potential with largest domain 
size is constructed when a message is passed from the root to EC\Y. The set 
of potentials associated with the root of T after it has received messages from 
all adjacent cliques is # = {P{a \ Ci^Cz), T’(C'i), PiCz), P{Cz), P{b \ Ci,Cs), 
P{Y I Cz.Cz)}- Variables and C3 have to be eliminated, but neither C2 nor 
C3 can be eliminated without creating a potential over variables: V, Ci, C2, C3. 
If either or Tq is constructed, then the largest domain size is three instead 
of four variables. On the other hand, if ^2 = {B = h,C = c\, then Tg and Tq 
are optimal. Similarly, if ^3 = {A = a,C = c}, then and Tg are optimal. 




Fig. 2: The three possible ways to divorce (7i, Cz, and C 3 . 



Example 1 shows that no matter what decomposition tree is constructed off-line 
there exists a set of evidence such that one of the other trees is a better choice. 

We propose a method based on performing the decomposition on-line and only 
when needed. The method is referred to as lazy parent divoreing. Instead of 
constructing a specific decomposition tree T^ for M off-line, different decom- 
position trees are constructed on-line. First, a junction tree representation T of 
Q is constructed without recognizing M. Next, the distributions of Q are asso- 
ciated with cliques of T as usual except that only a specification of Pm similar 
to definition 1 is used. 

Probabilistic inference in a junction tree based inference architecture consists of 
two phases. The first phase is message passing between adjacent cliques and the 
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P(Y I C'2. C3) 




P(A\Ci,C2) P(B\Ci.C3) P(C\C2.C3) P(E\Ci,Y) 
p(Ci) P(C2) p(Ca) 



Fig. 3: A junction tree for the Bayesian network shown in figure 2(B). 

second phase is calculation of posterior marginals. In either phase an ICI model 
can be relevant for a calculation. If # is the set of potentials relevant for some 
calculation and Pm G ^ represents an ICI model M , then a decomposition tree 
for M is constructed on-line before the calculation is performed. 

Example 2 . Continuing example 1 consider the junction tree T shown in fig- 
ure 1 (B). Assume = {A = a, B = b}. The set of potentials associated 
with the root of T after it has received messages from all adjacent cliques is 
$ = {P{a I C^,C2), Pih I CuC^), P(Ci), P(C2), P(d3), P{E I Cl, 172,^3)}. 
P{E I Cl, C'2, C3) is not stored explicitly, only a specification similar to defini- 
tion 1 is stored. The potential with the largest domain size is constructed when 
the posterior marginal P{E) is calculated. Based on the first five potentials of 
# and the structure of E\ either or Tq is constructed. Either of or Tq 
gives a largest potential domain size of three variables. The decomposition tree 
7 )f, for instance, corresponds to the following decomposition: 



P(C|Ci,C2,C3) = ^ P{E^°^\C^) 

{Y,E<=’3\E^Y*E<=^3} 

^ P(C^i|Ci)P(C^"|C2). 

{E^i ,E<^2\Y=E<^i *E^2 } 



The evidence scenarios £2 and £3 are handled similarly. 

The problem of finding an optimal decomposition tree for M is expected to be 
A/’P-hard. The heuristic used in the current implementation is based on a score 
s of making two cause variables C* and Cj parents of a common intermediate 
child variable. The score is equal to the sum of the weights of the fill-ins needed 
to complete the neighborhood of C, and Cj in the domain graph of 



s(Ci,Cj)= Y, 

V,Weadj(Ci)Uadj{Cj)\{Ci,Cj},V^adj(W) 



|E|-|IT|. 
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It is clear that for a given ICI model the decomposition tree constructed on-line 
is one of the decomposition trees that could have been constructed off-line. As 
shown above it is, however, not always possible to construct a decomposition tree 
off-line such that is optimal for all possible sets of evidence. Even if it is 
possible to construct off-line, it is not easy to determine the structure of 
off-line. In the case of on-line construction of decomposition trees all the relevant 
information is represented in the set of potentials relevant for a calculation. 

A new decomposition tree T is constructed each time an ICI model M is found 
relevant for a calculation. Note that it might be unnecessary to construct T at 
all. This happens when the effect variable E oi Mis marginalized out and none 
of its descendants have received evidence, when certain kinds of evidence on E 
is present, or when none of the variables in M are relevant for the calculation. 



3 The Factorized Representation 

The main idea of the factorized representation proposed by [ 13 ] is to introduce 
a set of intermediate variables representing P{E < e) for each state e of E 
except one. The intermediate variable E-'^ corresponds to the product of the 
contributions from the cause variables to E < e. This is used to calculate P{E = 
ei) as P{E = ei) = P{E < e,) — P{E < e,_i) where e,_i < e, are states of E. 



3.1 Factorized Representation of Noisy-OR 

Consider a noisy-OR interaction model M consisting of binary variables E, C \ , 
and C2. We will use M to describe how the factorized representation can be 
derived from the local expression language introduced by [ 3 ]. 

A local expression describing the dependence of E on C\ and C2 can be specified 
as below (subscripts denote the domain on which the expression is defined): 



exp{E\Ci,C2) — l£;=i|Ci,C2 “ i'^E=t\Ci ~ QE=t\Ci)('^E=t\C2 ~ <lE=t\C2) 

+ (Is^/ICi “ QE^f\Ci){'^E^f\C2 - te=/|C2)i 

where qE^t\Ci and qE^f\Ci both represent P{E = t \ Ci = t, C2 = /) and 
P{E = t\ C\ = /, C2 = /), but defined on different domains. As At is a noisy-OR 
interaction model P{E = t\Ci = /, C2 = /) = 0 . 

The expression exp{E \ Ci,C2) has the disadvantage that it is only available 
to the expression for the entire network exp{E,Ci,C2) if we distribute over 
exp{E I Cl , C2). This operation is, in general, exponential in the number of causes 
of M and is therefore an expensive operation. By introducing an intermediate 
variable C-f with two states v and I as an intermediate variable between Ci, 
C2 and E it is possible to represent M with two expressions: 




Lazy Propagation and Independence of Causal Influence 299 



exp(E-^ \ Ci,C2) =l£;</=/|Ci,C2 

+ i^E<f^v\Ci ~ QE<f^v\Ci)i^E<f^v\Ci - QE<f^v\C-2)^ 
eXp{E\E-^) | . 

The expression exp{E | C'i,C'2) can be reconstructed by combining the expres- 
sions above and marginalizing E-f out: 



exp{E\Ci,C2) = '^exp(E\E^f)*exp(E^f\Ci,C2) 

E<f 

= '^(^E^t\E<f^I ~ '^E^t\E<f^v + ^E^f\E<f^v) 

E<f 

* {^E<f=I\Ci,C 2 + i^E<f^v\Ci ~ QE<f^v\C^) 

* (l£;</ = ^|C2 “ QE<t^v\C2)) 



— ^ l£;=i,£;<^=/|Ci,C 2 
E<f 

~ i^E^t,E<f^v\Ci ~ QE^t,E<f^v\C'i) 

* (l£;=i,£;</ = r|C 2 “ QE^t,E<f^v\C 2 ) 

+ i^E^f,E<f^v\Ci ~ QE^f,E<f^v\Ci) 

* i^E=f,E<f=v\C 2 ~ lE^f,E<f^v\C 2 ) 

= l£;=i|Ci,C2 “ (Is^ilCi “ QE^t\Ci){'^E^t\C2 ~ QE^t\C2) 

+ (Is^/ICi “ QE^f\Ci){'^E^f\C 2 - lE^f\C 2 )- 

Based on the above derivation of exp{E-f |C'i,C'2) and exp{E\E-^) potentials 
H{E I E-f), G{E-f I Cl), and G{E-f \ C2) are constructed, see table 1 . Each 
G{E-f I Ci) describes the contribution from C, to E < /. The potential H{E \ 
E-f) describes how to calculate the distribution of E from the distributions of 
the intermediate variables. P{E \ Ci,C2) can be obtained by marginalizing out 
E-f of the product of El and the G’s: 



2 

P(E|Gi,G 2 ) = H{E\E^f)l[G{E^f \Ci). 

E<f i=l 



( 1 ) 
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Table 1. Potentials for G{E^^ \Ci) (A), G{E^^ \C 2 ) (B), and H{E\E^^) (C). 



Oi 


E^J 




G2 


E^J 




E 




V 


I 




V 


I 


f t 


/ 


1 


1 


f 


1 


1 


V 1-1 


t 1 


— qE=t\Ci=t,C2=.f 


1 


t 1 


— qE=t\Ci=f,C2=t 


1 


I 0 1 




(A) 






(B) 




(C) 



Note that equation 1 only uses marginalization of variables and multiplication of 
potentials, no special operator is required. Therefore, no constraints are imposed 
on the elimination order or the order in which potentials are combined. 

The H and G potentials do not have the usual properties of probability poten- 
tials. The H potential includes negative numbers and for some parent configu- 
rations the distribution of E consists of all zeros. Both the H and G potentials 
have the property that for a given parent configuration the entries in the po- 
tential do not necessarily sum to one. This implies that marginalizing a head 
variable out of one of these potentials does not result in a unity potential. 



3.2 Factorized Representation of Noisy-MAX 

The factorized representation of noisy-OR interaction models can be generalized 
to capture noisy-MAX interaction models. Let At be a noisy-MAX interaction 
model with variables E, Ci,C 2 , ■ ■ ■ ,Cn- One intermediate variable E-'^ is intro- 
duced for each state e oi E except the highest state. If E has to -I- 1 states, then 
a graphical interpretation of the factorized representation of M can be depicted 
as shown in figure 4. 




Fig. 4: A graphical interpretation of the factorized representation of At. 



The complexity of ICI model representation is reduced from exponential in n to 
exponential in \E\. As \E\ is usually much smaller than n considerable efficiency 
improvements can be expected. 
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Example 3. Let Mhe & noisy-MAX interaction model where each variable has 
ordered domain I < m < h. The potentials required to represent M in factorized 
form are shown in table 2 (where qE>e\Ci^c is shorthand for qEye\Ci=cyj^iCj=i)- 
The potential P{E \C\,. . . , C„) can be obtained by eliminating E-^ and i?-™ 
from the factorized representation of M: 



P(E\pa(E))= Y, H(E\E^^,E^^) JJ G(E^^\C)G(E^^\C). 

E<\E<™- C€pa(E) 



Table 2. Potentials for G{E^^\C) (A), G{E^'^\C) (B), and H{E\E^^E^'^) (C). 



G 
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I 


G 




V 


I 


E<i 




i 


E 

m 


h 


T 


T 


~ Qe>i\c=i 


T 


T 


T 


— qE>m\C=l 


T 


V 


V 


0 


0 


0 


m 


1 


~ qE>l\C=m 


1 


m 


1 


qEym\C=m 


1 


V 


I 


1 


-1 


0 


h 


1 


~ QE>l\C=h 


1 


h 


1 


— qE>m\C=h 


1 


I 


V 


0 


1 


-1 


















I 


I 


0 


0 


1 



(A) (B) (C) 



4 Empirical Results 

A series of tests were performed in order to determine the efficiency of exploiting 
ICI with lazy parent divorcing and factorized representation during inference in 
the lazy propagation architecture. The efficiency of the approaches is compared 
with the parent divorcing and temporal transformation [6] approaches. 

The tests were performed on six different versions of the CPCS-network [10] with 
40, 145, 245, 360, 364, and 422 nodes, respectively. All the ICI models in the 
networks are noisy-MAX interaction models. For each network evidence sets with 
sizes varying from 0 to 25 instantiated variables are chosen randomly. For each 
size, 25 different sets were generated. The average size (in terms of numbers) 
of the largest potential calculated during inference in the 145 and 245-nodes 
networks are shown in figure 5. See [7] for additional test results. 

The test results indicate that lazy parent divorcing and the factorized represen- 
tation offer considerable improvements in space efficiency compared to parent 
divorcing and temporal transformation. The efficiency of lazy parent divorcing 
and the factorized representation seems to be of the same level. The factorized 
representation is, however, more efficient that lazy parent divorcing for small 
sets of evidence in the 422-nodes network. 
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(A) 145-nodes network. 



(B) 245-nodes network. 



Fig. 5. Plots of the test results for the 145 and 245-nodes subnetworks. 



5 Discussion 

The lazy parent divorcing and factorized representation approaches to exploit- 
ing ICI fit naturally into the lazy propagation framework. Lazy parent divorcing 
postpones the construction of decomposition trees in the same way as lazy prop- 
agation postpones the combination of potentials. The factorized representation 
offers a multiplicative decomposition of ICI model distributions similar to the 
multiplicative decomposition of clique and separator potentials already exploited 
in the lazy propagation architecture. 

Both lazy parent divorcing and the factorized representation introduce a number 
of intermediate variables to reduce the complexity of inference. The intermediate 
variables are not part of the original Bayesian network and therefore not part of 
the junction tree constructed from the original Bayesian network. This raises the 
issue of when to eliminate intermediate variables most efficiently. The heuristic 
currently used in the implementation is to eliminate an intermediate variable V 
when the number of fill-ins added by eliminating V is two or less. 

Lazy propagation exploits a decomposition of the clique and separator potentials 
while the factorized representation and lazy parent divorcing offers a decomposi- 
tion of conditional probability distributions of the effect variable in an indepen- 
dence of causal influence model. The decomposition of conditional probability 
distributions is only exploited if some cause variables are eliminated before the 
intermediate variables as eliminating all the intermediate variables before any 
cause variable will reconstruct the conditional probability distribution of the 
effect variable given the cause variables. 

We have described how the factorized representation of noisy-OR interaction 
models can be generalized to noisy-MAX interaction models. The factorized 
representation cannot, however, be generalized to represent all ICI models effi- 
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ciently. An in depth analysis of the limitations of the factorized representation 
is given in [13]. Examples of the models efficiently representable with the fac- 
torized representation are noisy-OR, noisy-MAX, noisy-MIN, and noisy-AND. 
Noisy-ADD is an example of an ICI model not efficiently representable using the 
ideas of factorized representation [13]. 

Lazy parent divorcing does not, on the other hand, have any limitations with 
respect to representation of general ICI models. The ideas of lazy parent di- 
vorcing also seems applicable to construction of decomposition trees when the 
structure of the conditional probability distributions is such that the indepen- 
dence relations only hold in certain contexts. This is known as context specific 
independence, see for instance [1]. How to use the ideas of lazy parent divorcing 
to exploit context specific independence is a topic of future research. 

In [15] it is reported that on average 7Mh of memory is required to propagate 
randomly chosen evidence in the 245-nodes network using the heterogeneous 
factorization to exploit ICI. The average size of the largest potential calculated 
using either lazy parent divorcing or the factorized representation is in this 
network 5K. It has not been possible to compare lazy parent divorcing and the 
factorized representation with the approach suggested by [11] as they do not 
offer any empirical results. 

For a given ICI model M the corresponding factorized representation can be 
constructed off-line. The lazy parent divorcing approach, on the other hand, 
requires additional time on-line to construct the decomposition tree for M during 
inference. The impact of the factorized representation and lazy parent divorcing 
on the time efficiency of inference is a topic of future research. 



6 Conclusion 

In this paper two different approaches to exploiting ICI to improve the efficiency 
of probabilistic inference in Bayesian networks are presented. Both the factorized 
representation and lazy parent divorcing fit nicely into the lazy propagation 
framework. Lazy parent divorcing is a method to exploit ICI based on on-line 
construction of decomposition trees. This makes it possible to exploit structure 
of the evidence to improve the efficiency of decomposition tree construction. The 
lazy propagation architecture is based on a multiplicative decomposition of clique 
and separator potentials and it is readily extended to exploit the multiplicative 
decomposition offered by the factorized representation. 

The reader should note that neither lazy parent divorcing nor the factorized 
representation imposes any constraints on the order in which variables can be 
eliminated or the order in which potentials can be combined. 

The empirical results indicate that considerable improvements in efficiency of 
probabilistic inference can be expected by the use of either lazy parent divorcing 
or the factorized representation to exploit ICI. 
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Abstract. In his paper, a new Monte Carlo algorithm for combining 
Dempster-Shafer belief functions is introduced. It is based on the idea 
of approximate pre-computation, which allows to obtain more accurate 
estimations by means of carrying out a compilation phase previously to 
the simulation. Some versions of the new algorithm are experimentally 
compared to the previous methods. 



1 Introduction 

From a computational point of view, one important problem of Dempster-Shafer 
theory of evidence [2, 10] is the complexity of the computation of the combination 
of independent pieces of evidences by means of Dempster’s rule. It is known that 
the exact computation of the combination is a NP-hard problem [7] . 

This fact motivates the development of approximate schemes for performing 
Dempster’s rule. Moral and Wilson have studied a wide class of approximate 
algorithms based on Monte Carlo methods. Those algorithms can be classified 
into two groups: Markov Chain Monte Carlo algorithms [5] and Importance 
Sampling algorithms [6] . In this paper we propose a method that corresponds to 
the importance sampling group. The new feature we introduce is the approximate 
pre-computation phase, earlier utilized in probabilistic reasoning [3,4]. 

The paper starts with some basic concepts and notation in Sec. 2. Known 
algorithms are reviewed in Sec. 3, and the fundamentals of importance sampling 
in Evidence Theory are given in Sec. 3.3. The new scheme we have developed is 
presented in Section 4. Section 5 is devoted to the description of the experiments 
carried out to check the performance of the new algorithms, and the paper ends 
with the conclusions in Sec. 6. 

* This work has been supported by CICYT under projects TIC97-1135-C04-01 and 
TIC97-1135-C04-02. 
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2 Basic Concepts and Notation 

Let 0 be a finite sample space. By 2® we denote the set of all the subsets of 0. 
A mass function over 0 is a function m : 2® [0, 1] verifying that m(0) = 0 

and = 1- Those sets for which m is strictly positive are called focal 

sets. A function Bel : 2® [0, 1] is said to be a belief function over 0 if there 

exists a mass function m over 0 verifying Bel(X) = ^ • 

This definition establishes a one-to-one correspondence between mass functions 
and belief functions [10]. Other functions can be defined from a mass function: 
namely, the plausibility and the commonality. A plausibility function over 0 is 
a function PI : 2® [0, 1] defined as Pl(iA) = for X € 2®. A 

commonality function over 0 is a mapping Q : 2® [0, 1] defined as Q(Ai) = 

for all X € 2®. 

Any of the functions defined above contains the same information, and it is a 
matter of convenience which one to use. However, mass functions may be more 
compact if only the masses of the focal sets are given. 

Now assume we have n mass functions mi, . . . ,m„ over 0. The information 
they contain can be collected in a new mass function m = mi ® ® m„ , where 

® denotes the combination by Dempster’s rule: 

m{X)=k~^ ^ mi(Ai) • • •m„(A„) , VX C 0 , (1) 

Ai,... ,A„C0 
Ain---nA„=x 

where k = Y.^^ mi(Ai) • • •m„(A„). 

Ain---nA„jAj 

The combined measure of belief over 0 is given by Bel(X) = m(^)- 

The aim of this paper will be the estimation of Bel(X) for a given X C 0. 

Next, we show how the calculation of Bel(Ai) can be formulated as the esti- 
mation of a probability value by sampling from a multivariate probability dis- 
tribution. 

We shall denote by P, a probability distribution over 2® defined as Pj(A") = 
m,(W) for all W € 2® . An ordered collection of sets (0i , . . . , 0„), 0, € 2® will be 
called a configuration, and denoted by letter C. Let Q' be the set of all possible 
n-dimensional configurations, and Q = ,C„) € 17' | Ci ^ 0}. 

From now on, the intersection of all the sets in a configuration C, will 

be abbreviated by Ic- 

The following distribution can be defined over fi': P'(0) = P*(*^*) 

all C G 17'. 

Let Pds be a probability distribution over 17 such that for all 0 G 17, 
Pds( 0) = P'(0)/P'(l7). From this notation, it follows that, for all W C 0, 
m(7f) = E Pds( 0), and Bel(W) = E Pds(0). 

Ic^X IcCX 

So Bel(W) can be estimated by sampling configurations, 0, with probability 
distribution Pds and calculating the relative frequencies with which Ic C X . The 
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main problem of Monte Carlo algorithms is to sample from this distribution Pds- 
P' is simple to manage and simulate, because it is the product of n probability 
distributions, but Pds makes a global restriction by conditioning to 17, so that 
the selections of the different components are not independent. 

3 Previous Simulation Algorithms 

3.1 The Simple Monte Carlo Algorithm (MC) 

One simple way of obtaining configurations C with probability Pds was proposed 
by Wilson [11] and consists of selecting sets Ci € 2®, i = 1, . . . ,n with prob- 
ability Pj(Cj) until we get a configuration of sets C = (Ci, . . . , C„) such that 
.fc 7 ^ 0- In this way, we obtain a configuration C with probability proportional 
to Pds(C'). 

The main drawback of this method is that it may be very difficult to get a 
sequence with nonempty intersection. More precisely, the probability of obtaining 
such a sequence is P'(l7). The value 1 — P'(12) is known as the conflict among 
the evidences. So, we can say that this algorithm does not perform well if the 
conflict is high (i.e. if P'(12) is close to 0). See [6, 11] for more details. 

3.2 The Markov Chain Monte Carlo Algorithm (MCMC) 

Unlike the simple Monte Carlo, in this case the trials are not independent. In- 
stead they form a Markov chain. It means that the result of each trial depends 
only in the result of the previous one. 

Moral and Wilson [5,6] designed a method for generating a sample from 
Pds that starts with an initial configuration = (Cj , . . . , C°). Then, the next 
element in the sample is selected by changing only one of the coordinates in 
the current configuration and so on. More precisely, one configuration is 
constructed from by changing the Uth coordinate in . The new set Cf 
in the i-th coordinate is selected verifying I(jj ^ 0 and with chance proportional 
to ViiCi). 

For each configuration C* , the corresponding trial will be successful if Ici C 
X. Bel(A) is again estimated by the proportion of successes. 



3.3 Importance Sampling Algorithms 

Importance sampling is a popular technique in Statistics, used to reduce the 
variance in Monte Carlo simulation [8]. It has been successfully applied to ap- 
proximate probabilistic reasoning [3,4] and to belief functions calculation [6]. 
The idea of this technique is to try to obtain points in those regions of higher 
importance (probability) , using during the sampling process a distribution easier 
to handle than the original one (Pds), in the sense that it must be easy to obtain 
a sample from it. 

Consider a probability distribution P* over fi such that P*(C) > 0 for all 
Cel? with Pds(C) > 0. Then, we can write 
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Bel(X) = ^ Pds(C^) = 5] Pds(C^)^x(/c) 
c<^Q ceo 

IcCX 



= E 

CgO 



Pds{C)Sx{Ic) 

P*{C) 



P*{C) = E 



Pps{'2)^x{Iz) 

P*(Z) 



where Sx{Ic) = 1 if Ic C X and 0 otherwise, and Z is a random vector with 
distribution P*. Thus, w = PDs(Z)<5x(-fz)/P*(Z) is an unbiased estimator of 
Bel(X) with variance 



Var(w) = 



= E 

ceo 



Pls(C)SWc) 

{p*{c)y 



P*(C) -Bel^(X) 



= E 

Ceo 



P*(C') 



Bel^(X) . 



Now let {C^ , • • • , C^) be a sample of elements of fi obtained by sampling 
from P*, and Wi the value of the estimator for C*, also called weight of C*. Then 
Bel(X) can be approximated by the sample mean as Bel(X) = (1/N) wt*. 
Note that 



r pps(go 

Wi = < P*(C*) 

0 



if Ic^QX , 
otherwise. 



which implies that Bel(JA) = (1/iV) 1 cx 

The variance of this sample mean estimator is 



Var(Bel(X)) = Var 



1 

iv2 

1 

N 




(TVVar(w)) 




Pls(C)Sj,(Ic) 

P*{C) 



f 



1 

N 



SX Pps(*^) 

p*(C) 

ceo 

\lcQX 



Bel^(X) 

\ 

5 

/ 



(2) 



(3) 



Bel^(X) 
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where we have used that 5\{Ic) = 1 for Ic C X. 

The minimum variance is equal to 0, and is achieved when P* = PDs(-|f?x), 
where fix = {C £ fl \ Ic Q X}. For each Cel?, PDs(C|f?x) = Pds(C n 
f2x)/PDs(f^x), and PDs(f^x) = Bel(X). It means that we need the value of 
Bel(X) to construct a P* which minimizes the variance of the estimation, and this 
value is unknown: it is precisely what we are trying to calculate. Furthermore, 
P* depends on X and different simulations are necessary for different sets of 
interest X. 

To obtain a small variance, we must select P* close to Pds- P* niust be easy 
to handle, in the sense that it must be easy to simulate a sample from it. In this 
case the minimum variance we can reach is (l/7V)Bel(X)(l — Bel(X)), which is 
always less than or equal to 1/(47V). 

Once P* is selected, the problem is how to compute the weights of the con- 
figurations. For a given configuration C*, its weight is w, = Pds(C*)/P*(C'*)- 
P*(C*) is easy to compute if we select P* appropriately. But it is not that easy to 
calculate Pds(C*) = P^(C'*)/P^(f2), because P'(f?) = l^oral and 

Wilson [6] propose to use this weight instead: Wi = P'(C'*)/P*(C'*). In this case, 

V S w'< = s'”*"* S »< • 

Ici fX I^i CX 

is an estimator of Bel(W) • P'(l7). Since O-jN) Wi is an unbiased estimator 
of P'(l7), then we can estimate Bel(W) with the value 

Yj cxW 

Bel(W) = . (4) 

Moral and Wilson [6] propose two algorithms based on the importance sam- 
pling technique. 



Importance Sampling with Restriction of Coherence (ISl). This scheme 
is very similar to the simple Monte Carlo algorithm (see Sec. 3.1). The difference 
is that, instead of generating a configuration C* and then checking the condition 
C* e fl, whenever a new coordinate Cf is going to be generated, it is chosen 
with probability proportional to P,(C'f) from those sets verifying that Cf fl 
(Plj^i C'j ) 7 ^ 0- 111 this way, only configurations in fl are selected. 

Following this procedure, the probability of selecting a given configuration 
C = (Ci,... ,C„) is 



:^APm(f) n”_,Pm 



(5) 



where Pco is the probability of coherence: Pco(0 = Scindn’") ^*(*^*)- 
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Then, the weight corresponding to C is IT = P'(C')/P*(C) = Pco(*)- 

Therefore, for a sample , C'^) generated from P*, the estimation of 

Bel(X) is 



Bel(X) 



rij=i Pco(i) 

E;iin”.iPco(i) 



(6) 



Selecting Configurations Containing an Element (IS2). This procedure 
is based on selecting an element 9 £ 0 and then selecting a configuration such 
that each one of its coordinates contains 6. In this way, every configuration has 
nonempty intersection: at least 9 belongs to all the sets. 

Each 9 is selected with probability P(0) = Q! {9) ! Q^("0), where Q' is 

an approximation of the commonality function Q of the combined belief. 

Let be 0* = {C £ 17 | 9 £ Ic}- Then a configuration C is selected from 9* 
with probability proportional to Pds(C)- This selection is easy since for C £ Q, 
P'(qr) = P'{C\9* n 17), which is equal to Pds(C|6'*) = Pds(C')/Q(6')- Then, 
choosing a configuration with probability proportional to Pds is the same as 
choosing it with probability proportional to P'. 

Thus, we have that 



P*(C) = ^ P(0)PDs(t7)/Q(0) ^ ^ 

9eic eeic 



Q'jO) 

Q{0) 



Pbs{C) 



Therefore, the weight W associated with C is proportional to 

1 /E Q'W/QW > 

Bale 

which is equal to l/|7c| in the case that Q' = Q. 

The performance of this algorithm in experimental evaluations is quite good 
[6], especially if the focal elements have a small number of elements. However 
with big focal elements the algorithm may have problems: more difficulty in 
the approximation of Q' and more variation in the weights even if Q' = Q. 
Imagine for example, a belief function with two focal elements: one has only one 
element, 9, and the other the rest of elements of 0. Both have a mass of 0.5. If 
we select an element with a probability proportional to its commonality, then 
since all elements have the same commonality, the distribution is uniform. The 
focal element with only one element will be selected with a probability 1/|0|, 
and weight 1, and the other focal with probability (|0| — 1)/|0|, and weight 
1/(|0| — 1). If the number of elements of 0 is high, this will produce a great 
variance of the weights (most of the elements in the sample will have a small 
relative weight), which will produce poor estimations. 
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4 Importance Sampling Based on Approximate 
Pre-computation 

Here we propose a new importance sampling method. It is based on the algorithm 
for computing marginals in Bayesian networks developed by Hernandez, Moral 
and Salmeron [3,4]. 

The new algorithm is similar to the importance sampling with the restriction 
of coherence, but here, a previous step is carried out in which some information 
is collected in order to obtain a better sample. 

Note that each coordinate C, in a configuration C is selected taking into 
account just mass m, and the intersection of the coordinates already simulated. 
We propose to use also some information coming from the masses corresponding 
to the coordinates that still have not been simulated. In this way we try to avoid 
rare configurations with low weight. 

Let us describe the algorithm in more detail. We assume we are not able 
to compute Dempster’s rule. It means that the resources are limited and the 
number of focal sets in m is too big to be handled by the computer. Let s be the 
maximum number of focal sets that the system can handle. 

Given two masses mi and m2, their approximate combination with limit s, 
will be a mass mi(8)sm2, calculated in the usual way (adding the non-empty 
intersections of focal elements of mi and m2), but performing the following 
reduction step each time the limit s is surpassed: 

1. Select two focal sets B\ and B2 and remove them from the list of focal sets 

of mi(i)5m2. Let he B = B\ U B2- 

2. If B is already a focal set of mi(i)5m2, then 

(mi(i)5m2)(H) := (mi(8)5m2)(H) + (mi(i)5m2)(Hi) + (mi(i)5m2)(H2) , 

else, add H as a new focal set of the combination with 

(mi(i)5m2)(H) := (mi(i)5m2)(Hi) + (mi(i)5m2)(H2) . 

The idea is to carry out an approximate and feasible computation by replac- 
ing two focal sets Bi and B2 by their union, each time the maximum number of 
focal sets s is surpassed. Using this procedure, n new masses mj, . . . , m* can be 
defined according to the following recursive formula: mj = mi,m^ = mi,(i)5m^_j, 
where k = 2 , . . . ,n. Note that these masses can always be computed since they 
do not have more than s focal sets. 

Now we define the following sampling procedure. To obtain a configuration 
C = (Cl, ... , G„) the sets are calculated from n to 1 according to the following 
distributions 



,Cn) = 



k-^Vi{Ci)v\*_^{Ci) if Gi n n • • • n ^ 
0 otherwise. 



( 7 ) 
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where Pl*_i is the plausibility corresponding to mass m*_j, PIq = 1 and ^ is 
a normalization constant given by fc, = i{A)V\i-i{A) , with 

k\ = 1. We have to check that P* is valid as sampling distribution, that is, it 
must be positive whenever Pds is- The only thing we have to check is that the 
product of all PI, is positive when Pds is, but this is true since in the approximate 
combination, removed sets are replaced by the union of both. Then, if a set has 
positive mass, then it has also positive plausibility after the deletion. 

The intuition behind this sampling procedure is the following: If PI* were the 
exact plausibility associated with mi ® ® m,, for each i, then the sampling 

distribution P* would be equal to the desired distribution Pds (this is simple 
to see from the definition of this probability distribution). With approximate 
combinations m* we are trying to simulate as close as possible to the distribution 
Pds- 

One remarkable property of the approximate combination is that it can be 
detected when the exact calculation is feasible: it would be the case in which 
the size threshold is not surpassed in any of the combinations performed to 
compute mj, . . . ,m* . If this happens, in m* we obtain the exact combination 
by Dempster’s rule, and then there is no need to simulate. 

The weight assigned to a configuration C will be 



W = 



P'(C) 



nti pdCi) 



P*(C^) Pi(t7i) n”.2 



n 



pl*_l(co 



(8) 



4.1 The Main Algorithm 

The pseudo-code for performing the tasks concerning this algorithm can be as 
follows. Let MASSj(y) be a function returning the sum of Pj(A)Pl,_i(A) subject 
to AnY ^ 0, and SELECT,(y) returning one element A € 2® subject to AC]Y ^ 
0 with probability P*(A). Let PLAUj(F) be a function returning Pl,_i(y). The 
following algorithm computes an estimation of Bel (A) with X C 0 from a 
sample of size TV: 

FUNCTION ISAPf 

mj = TOi 

for n = 2 to n 

m* = m^gj^m^-i 

next V 

Si := 0.0 ; S 2 ■■= 0.0 

for n = 1 to iV 

Y -.= 0 -,W := 1.0 
for j = n downto 2 

W:=W* MASSj(y)/PLAUj(y) 

Cj := SELECTj(y) 

y := y n Cj 
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next j 

Cl := SELECTi(y) 

y := y n Cl 

Let C = (Cl , . . . , C„) 

if 1(7 C y then 
Si --Si+W 
else S 2 :=S 2 +W 
next V 
retnrn „ 



4.2 Particnlar Cases 

The performance of the ISAP algorithm depends on the way in which the ap- 
proximate combination is done. More precisely, the critical decision to be made 
is the selection of the sets to be removed in step 2. a in the approximate combi- 
nation. Those sets should be selected such that the distance between the mass 
before and after the elimination be as small as possible. We have considered the 
following measure of the cost of removing two sets A and B and adding their 
union AU B with m{A [J B) = m{A) + m{B): 



d{A, B) = miAjlB - A\ + m{B)\A - B\ . (9) 

Thus, when performing the approximate combination, a first option can be 
to compute d(A, B) for every focal sets A and B, and select those producing the 
lowest cost. We will reference the ISAP algorithm with this option as ISAPl. 
However, if the number of focal elements to be compared is high, it can be 
too costly to compute all the values of the cost function. In order to skip this 
problem, we have tested two alternatives. In the first one, the focal sets of the 
combination are kept in a list such that whenever a new focal set is inserted, 
it is placed between those two sets for which the addition of the costs with 
respect to the new set is the lowest. Then, when two sets must be removed, the 
cost is computed just for adjacent sets in the list. We call ISAP2 this case of 
the algorithm. The second option is similar to the former, but now the sets are 
inserted in the list in lexicographical order. This will be called IS APS. 



5 Experimental Evaluation 

The performance of all the algorithms described in this paper has been ex- 
perimentally tested. The experiment consisted of combining 9 belief functions 
defined over a sample space 0 with 50 elements. The number of focal sets in 
each function was chosen following a Poisson distribution with mean 4.0. Each 
focal set was determined by first generating a random number p € [0,1] and 
then including each 6 & 0 with probability ^ (so that to obtain focal sets with 
a high number of elements) . 
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The experiment has been repeated 10 times for different belief functions. In 
each repetition, a set X C 0 was selected at random and its belief Bel(JA) was 
exactly computed and also approximated by the Monte Carlo algorithms. The 
number of runs (i.e. the sample size) was set to 5000, and each algorithm was 
execut^ 100 times. 

If Bel(JA) was the approximation given by one execution of a Monte Carlo 
algorithm, the error in the estimation was calculated as 



|Bel(X)-Bel(X)| 

i/Bel(X)(l -Bel(X)) ’ ^ 

in order to make the errors independent of the value Bel(X). The mean of the 
errors is shown in table 1. 

According to the obtained results, we can say that the new algorithms in 
general provide better results than the previous ones. However, there are not 
clear differences among the three versions of the ISAP algorithm. We think this 
is due to the definition of function d (see Eq. 9) . The results of the new methods 
do not seem to depend on the value of P'(C). In fact, we have checked that in 
other examples with much lower values of P'(C), the error levels keep steady. 



Exp. 


P'(f2) 


MC 


MCMC 


ISl 


IS2 


ISAPl 


ISAP2 


ISAP3 


1 


0.210590 


0.028638 


0.013578 


0.015282 


0.015407 


0.009971 


0.012571 


0.011800 


2 


0.218410 


0.026885 


0.011841 


0.014306 


0.016761 


0.011758 


0.010731 


0.012193 


3 


0.221991 


0.022345 


0.014493 


0.013641 


0.014716 


0.011360 


0.011528 


0.010813 


4 


0.269354 


0.021218 


0.012873 


0.015125 


0.015278 


0.010635 


0.010534 


0.010434 


5 


0.323525 


0.016665 


0.014102 


0.014066 


0.015277 


0.011263 


0.010742 


0.011107 


6 


0.383930 


0.018580 


0.012625 


0.015151 


0.014154 


0.012355 


0.011051 


0.012406 


7 


0.439027 


0.016712 


0.013032 


0.015624 


0.013595 


0.012182 


0.011447 


0.010541 


8 


0.455528 


0.018179 


0.013355 


0.013546 


0.015219 


0.011607 


0.011489 


0.010506 


9 


0.559075 


0.015046 


0.011789 


0.012395 


0.021883 


0.011388 


0.011904 


0.012044 


10 


0.568261 


0.014193 


0.010899 


0.011204 


0.017189 


0.010423 


0.011737 


0.011192 



Table 1. Errors when combining 9 belief functions 



6 Conclusions 

We have presented a new technique for combining belief functions that shows 
a good performance. This conclusion is supported by the experiments we have 
carried out, showing a behaviour clearly superior to existing algorithms. One 
feature of this scheme is that it can be detected when the exact combination 
can be done. Many aspects are still to be studied; for instance, a good measure 
of the cost of removing focal sets could provide more accurate estimations. We 
have only given a simple, but perhaps not very accurate cost function for an 
approximation step. This is supported by the fact that there are not important 
differences between the algorithm (ISAPl) that considers the cost of joining all 
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the pairs of focal elements and the algorithms (ISAP2 and ISAP3) considering 
only some of them. Besides, instead of representing mass functions as lists of 
focal sets, a more sophisticated representation could be used. The semi-lattice 
representation [9] would facilitate the realization of the operations among masses. 

The algorithm we have developed deals with beliefs defined over the same 
sample space. Some more works are needed to extend it to the case of multivariate 
belief functions. Exact algorithms for this case can be found in [1,9]. 
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Abstract. In this paper, we improve the abilities of a previous model of 
representation of vague information expressed under an affirmative or negative 
form. The study more especially insists on information referring to linguistic 
negation. The extended definition of the linguistic negation that we propose 
makes it possible to deny nuanced property combinations based upon 
disjunction and conjunction operators. The properties that this linguistic 
negation possesses now allow considering it as a generalization of the logical 
one, and this is in satisfactory agreement with linguistic analysis. 



1 Introduction 

In this paper, we present a new model dealing with affirmative or negative 
information within a fuzzy context. We refer to the methodology proposed in ([13], 
[14], [15], [16]) to represent nuanced information, and more particularly, information 
referring to linguistic negation. The modelisation has been conceived in such a way 
that the user deals with statements expressed in natural language, that is to say, 
referring to a graduation scale containing a finite number of linguistic expressions. 
The underlying information being generally evaluated in a numerical way, this 
approach is based upon the fuzzy set theory proposed by Zadeh in [19]. Previous 
models allow us to represent a rule like “if the wage is not high, then the summer 
holidays are not very long” and a fact like “the wage is really low”. But, in previous 
methodology, facts and rules are only based upon nuances of one property. So, they 
do not accept facts or rules based upon combinations of properties as they may appear 
in knowledge bases including a rule like, “if the man is not visible in the crowd, he is 
not medium or tall”, or a fact like “the man is rather small or very small”. 

Our main idea has been to extend the definition of linguistic negation proposed in 
([13], [14], [15], [16]) in order to take into account combinations of nuanced 
properties based on conjunction or disjunction connectors. Representing nuances in 
terms of membership functions in the context of the fuzzy set theory ([19]) is nothing 
less than a difficult task and numerous works tackle different aspects of this problem: 
modifiers ([1], [2], [3], [5], [6], [12], [20], [7], [13], [14], [15]) or linguistic labels 
([18]). Section 2 briefly presents a previous model [7] allowing the representation of 

A. Hunter and S. Parsons (Eds.): ECSQARU’99, LNAI 1638, pp. 316-327, 1999. 
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nuanced properties. In Section 3, we present the main ideas leading to the previous 
methodology. More precisely, we point out that it is necessary to leave a 
representation of linguistic negation in terms of a one-to-one correspondence and to 
turn towards a one-to-many correspondence, called here a multiset function. We recall 
previous results leading to the standard forms of negation. The definitions of the 
linguistic negation and of the intended meaning of a denied property are based upon 
the ones proposed in ([16], [17]). Some modifications have been made to obtain better 
accordance with the linguistic analysis proposed in ([11], [4], [9], [8], [10]). Section 4 
is devoted to new developments which improve the abilities of the previous model. 
We extend the definition of linguistic negation in order to be able to deny previous 
combinations of nuanced properties. Then, we can define the set of intended 
meanings of “x is not A”. Finally, we study the particular contraposition law. It 
appears clearly that this more powerful concept of linguistic negation can be viewed 
as a generalization of the logical one. 



2 The Universe Description 

We suppose that our discourse universe is characterised by a finite number of 
concepts Ci. A set of properties' Pi^ is associated with each Ci, whose description 
domain is denoted as Di. The properties Pii^ are said to be the basic properties 
connected with concept Ci. For example, the concepts of “height”, “wage” and 
“appearance” should be understood as qualifying individuals of the human type^. The 
concept “wage” can be characterised by the basic fuzzy properties “low”, “medium” 
and “high”. Linguistic modifiers bearing on these basic properties permit us to express 
nuanced knowledge. This work uses the model proposed in [7] to represent 
affirmative information expressed in the form « x is fampPii^ » or « x is not fo.mpPik » 
in the case of negation^. In this context, expressing a property like “formpPik” called 
here nuanced property, requires a list of linguistic terms. 



' This approach to linguistic negation is semi-qualitative. Indeed, the knowledge is expressed in 
natural language, but the underlying information is evaluated in a numerical way. Note that a 
qualitative approach of linguistic negation based upon similarity relation is under study. 

^ The choice of referential types, fixes standard values of basic properties for typical individuals 
of the chosen referential types. 

^ An important difference exists between the variables occurring in linguistic fuzzy assertions 
and those defining membership degrees of the associated fuzzy sets. Consider, for instance, 
the linguistic predicate “tall” associated with the concept “height”. Fuzzy set theory tells us 
that “tall” must be represented by a fuzzy subset “talF of the real line, whose membership 
function, denoted as associates with any ueR a quantity a=p,o;;(u): in other words, any 
X, having u as “height” value (i.e. h(x)=u), belongs to the fuzzy set “talF to a degree a. 
Logically speaking, x satisfies the predicate “tall” to a degree a (i. e. “x is tall” is a-true) if 
W=lltoii(u)=hrnH(h(x)). So, X is a linguistic variable for individuals of the human type, and u 
variable for the corresponding real evaluations. In order to simplify the presentation of our 
model, if no confusion is possible, for any property A associated with a concept C, and 
represented by a fuzzy set A, if c(x) denotes the value of C for x, then the notation a=Px(x) 
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Two ordered sets of modifiers"^ are selected depending on their modifying effects. 

- The first one groups translation modifiers resulting somehow in both a translation 
and a possible precision variation of the basic property.- For example, the set of 
translation modifiers could be M 7 ={ extremely little, very little, rather little, 
moderately (0), rather, very, extremely} totally ordered by the relation: 
m(j<m| 3 <=>a<p (Figure 1). 

- The second one consists of precision modifiers which make it possible to increase 
(or decrease) the precision of the previous properties. For example, F6={ vaguely, 
neighbouring, more or less, moderately, really, exactly) totally ordered by the 
relation: fa<fp <=> a<(3 ( Figure 2). 





Fig. 2. Precision Modifiers 



3 Improvement of the Basic Approach to Linguistic Negation 

It has been pointed out ([13], [14]) that the fuzzy set theory [19] is hard put to propose 
an adequate representation of negative piece of information like “Smith is not tall”. 
Indeed, saying that “Smith is not tall”, one can implicitly refer to “medium” or 
“extremely small” and not to “— itall”. Within a qualitative context, one can refer to a 
set of linguistic labels denoted as L=(uo, ..., Unj totally ordered: Ui<Uj<=>i<j [12]. The 
linguistic negation, denoted as Neg, verifies: Neg(Ui)=Un_i. But, when the speaker 
asserts that “the wage of Smith is not high”, he can only say that the intended 
meaning belongs to the set (“medium”, “low”). So, the representation of linguistic 
negation should be made by using a correspondence between an element of L and a 
subset of L. More generally, any function associating with each nuanced property A, a 
subset of nuanced properties defined in the same domain as A, will be called a 
multiset function. Since the models presented in ([13], [14], [15]) alleviate the 
difficulties of Torra’s approach based upon a multiset function [18], we refer to this 
satisfactory methodology. The model proposed in [10] improves the previous ones 
([13], [14]) by explicitly taking into account the different scopes of the linguistic 
negation. This permits us to make clearer the different interpretations of “x is not A” 



stands for a=PA(c(x)). So, by using the previous example, the notation a=p,fl;;(x) stands for 
a=| 0 ,„„(h(x)). 

* We don’t focus our attention on the semantic aspects leading to such modifiers. Their number 
and the associated L-R functions have been chosen on the base of linguistic intuition to 
illustrate our purpose. A deeper semantic analysis of modifier meaning is actually under 
study. 
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resulting from scoping effects in accordance with linguistic theories ([11], [4], [8], 

[9]). 

Within the discourse universe, let us denote as: <3 the set of distinct concepts Ci, 'D\ 
the domain associated with the concept Ci, 7K the set of modifier combinations, “B\ the 
set of associated basic properties Pi^ defined on "Di, %ik the set of all nuances of the 
basic property Pik, ‘H\ the set of all nuanced properties associated with Ci. Then let: 
"8=yj^\, ■%=Ui%i, The reference frame of the linguistic 

negation is defined as follows. 

Definition 3.1. Let Neg a multiset function Neg : “PiS) verifying the conditions: 

LI: Vn^G Neg(n^)=%\{n^}, 

L2: VPikG^i, L21: 3m, Neg(Pik)=|Pim} orL22: Neg(Pik)=^,\{Pik}, 

L3: VxgZJ,, Neg(x)=Z?,\{x}, 

L4: Vn^G W. VP.kG Neg(n^P,k)=%i\{n^P,k], 

L5: Vn^G%, VP.kGg’, Neg(n^(P,k))=%,k\|n,P,k}5, and 

L6: VnyG M, L61: n/Neg(Pik))='%in, or L62: VNeg(Pik), ny(Neg(Pik))= %,W,k. 

From a linguistic point of view “x is not n^ Pik” can generally express something 
corresponding to an affirmative assertion “y is na Pjj”. Following are the different 
standard forms of the nuanced property ng Py resulting from each possible scope. 
Saying “x is not nyPik”, for this x, the speaker may express or refer to: 

- A rejection without reference to an affirmative translation of the negation [Fo] 
“Smith is not guilty” without reference to affirmative property, may occur in a context 
where the only thing about his culpability is that his alibi is confirmed. 

- Another object instead of x belonging to the same domain and satisfying the 

nuanced property . [Fi] 

So, “x is not ny Pik” means ”not(x) is ny Pik” or in other words, not(x)=yG Neg(x) 
(cond. L3). As an example, “Jack is not guilty” since it is John who is guilty. 

- Another nuance of the same property. [F2] 

So, “x is not nyPik”<=>”x is not(ny (Pik))”<=>”x is ng Pik”. So, not(ny (Pik))=ngPikG Neg(ny 
(Pik)) (cond. L5). As an example, “Jack is not small” since “Jack is extremely small”. 

- A property except nyPih which is nonetheless associated with the same concept. [ F3] 
So, “x is not nyPik” means ”x is not(nyPik)” which is equivalent to ”x is ng Py”. So, 
not(ny Pik)=ngPijG Neg(ny Pik) (condition L4). For example, “the wage being not very 
high” can be “really low”, “medium” or “rather little high”. 

- A nuance of another basic property associated with the same concept. [F4] 

So, “x is not ny Pik” means ”x is ny(not(Pik))” or ”x is ngPy”. Here, 
ny(not(Pik))=ngPijG ny(Neg(Pik)) (cond. L62). For example, “John is not small” since he 
is at least “medium”. 

- A nuance of another precise basic property associated with the same concept. [F5] 
So, “x is not ny Pik” means “x is ny(not(Pik))” or “x is ngPim”. Here, ngPi^G 
ny(Neg(Pik))=%im (Cond. L61). So, “Jack is not very big”, since “he is rather thin”. 



^ The conditions L4 and L5 define two different sets: the first one contains all nuances of all 
properties Pik associated with the concept Ci (except precise nuance Uy Pik), and the second 
one only contains all nuances of the given property Pik (except precise nuance UyPik). 
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- A new basic property of the same concept [Fg] 

In this case “x is not A” means ”x is not-A”: a new basic property denoted as “not-A” 
is associated with the same concept. As an example, “this wine is not bad” can induce 
the new basic property “not-bad”. 

Remark: In the following, we no longer refer to the first standard form Fi which 
seldom occurs in knowledge-bases and suppose that the new property introduced in 
the last form F6 appears among other basic properties. Moreover, for any AeTt and Ft 
(t=2, ...,5) the set Neg(A) is exactly defined. 

Definition 3.2 If standard form Ft (t=2, ...,5) is applied to nuance Ae^, the set 
Neg(A) will be denoted Negt(A). It defines the basic reference frame from which the 
intended meanings of linguistic negation of A will be extracted. 

Remark.- Since the standard form Fo implies a rejection of A without reference to an 
affirmative assertion, we can put: Nego(A)=0. 

In this work, we refer to the definition proposed in [16]. As a result, a comparison 
of different definitions proposed in ([13], [14], [15]) has been done in [17]: It is 
proved that the definition proposed in [16] can induce previous ones. In accordance 
with linguistic analysis of linguistic negation ([11], [4], [9], [8], [10]), it has been 
pointed out in ([13], [14]) that when one asserts that «x is not A» then, (1) one 
rejects a reference to « x is A », and (2) if necessary, one refers either to the logical 
negation of A, or to another property P different from A but defined in the same 
domain, or sometimes to a nuance fampA of A, or finally to a new basic property 
denoted as not-A. In the following, the judgement of rejection receives as an 
interpretation: Pa(x)<£, which differs from the one proposed in ([14], [16]). Indeed, 
linguistically speaking, one rejects that “x is A” receives a significant truth degree for 
x. In other words, its value is close to false, or the corresponding membership degree 
is close to 0. So, this value is not lower than a value close to 1, as is the case in the 
initial definition. Moreover, asserting that “x is not A”, if necessary the user refers to 
“x is P” as the intended meaning of his negation. The previous analysis only defines 
the standard forms of the linguistic negation. But, it is obvious that any element of 
Negt(u) cannot lead to the intended meaning of “x is not A”. Intuitively, the speaker 
understands a real difference between the membership degrees belonging to A and P 
for their significant values: |J,a(x) (resp. p,p(x)) > p=>p,p(x) (resp. Pa(x)) < £. 

Definition 3.3 Let p, £ such that: 0<£<p<l. Given a standard form Ft, the multiset 
function Neg^p^: ■%—>?(■%) defines for any Ag%, a set Neg‘p_e(A)<zNegt(A). More 
precisely, Pe Neg‘p_e(A) if and only if P satisfies the following conditions: 

CDO: PeNegt(A), CDl: Vx, (|J,a(x)>P=>|J.p(x)<£}, and CD2: Vx, ||J,p(x)>P=>|J,a(x)<£}. 
Then, Neg‘p_e(A) is said to be the linguistic negation p-compatible with A with the 
tolerance threshold e, given the standard form Ft. Moreover, any PGNeg'pe(A) is said 
to be a linguistic negation p-compatible with A with the tolerance threshold £, given 
the standard form Ft. 

Example: We have collected in Figure 3, Neg\v5 0.35 (“low”) a set of negations 0.75- 
compatible with “low” with tolerance threshold 0.35, given the standard form F3. 
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Fig. 3. Plausible negations 0.75-compatible with “low” with tolerance threshold 0.35 given F 3 . 

The previous definition constructs the effective reference frame Neg'p ^(A) from 
which we have to extract the intended meanings of the linguistic negation®. Then, we 
have to define a subset of Neg'p ^(A), denoted as neg'p ^(x, A), which consists of the 
intended meaning of “x is not A”. Since 8 is the tolerance threshold from which the 
membership degrees are significant, we accept as intended meanings only the 
solutions satisfying this condition for x. 

Definition 3.4 Put neg*p_e(x, A)={PGNeg*p_e(A)|jj,p(x)>8}. Any Peneg‘p_e(x, A) is 
called an intended meaning for x of the linguistic negation p-compatible with A with 
tolerance threshold e, given the standard form Fp We say also that “x is P” is an 
intended meaning with the tolerance threshold 8 of the linguistic negation p- 
compatible of “x is A”. If no confusion is possible, we simply say that P is an 
intended meaning of the linguistic negation of A for xP 

Example: By using previous solutions (Cf. Figure 3), we have collected in Figure 4 
the intended meanings of “x is not low” for the values a (2 solutions based upon 
“low”) and b (7 solutions based upon “medium” and “high”)*. 

Remark: This definition also differs from the one proposed in ([13], [14], [15], [16]). 
Linguistically speaking, “x is P” the intended meaning of “x is not A” should not be 
false (or not close to false) for this x. So, it seems more natural to reject a truth degree 
close to false and not a truth degree lower that a value close to true, as is the case in 
the previous definitions. So, in terms of membership degrees, we have to take a 
linguistic negation P at x for which ja,p(x)>8, instead of P for which ja,p(x)>p . 



® Given a standard form Ft the reference frame of the linguistic negation has been conceived in 
such a way that it contains all plausible nuanced intended meanings. Their number can be 
high when they refer to nuances of all basic properties associated with C;,. But, this cannot 
creates difficulty since they are obtained by an automatic process. Moreover, the strength of 
the linguistic negation, based upon the parameters p and e (see Proposition 3.3), allows us to 
control the number of elements the reference frame. This point will be presented in a further 
paper. 

^ Please, note that the linguistic negation for a given element does not require the exact value of 
the membership degree for the concept but only a lower bound e, taken as a parameter. 

* In this example, the number of intended meanings of the linguistic negation for the value a 
seems high, but we can point out the fact that we refer in this case to the most general 
standard form F 3 : another nuance, except the denied one, which is nevertheless associated 
with the same concept. That’s why, the following default strategy of choice selects one 
intended meaning among the previous plausible linguistic negations for x. 
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Fig. 4. Intended meanings of negations 0.75-compatible for a or b with “low” with tolerance 

threshold 0.35 given F 3 . 

If the user wishes only one interpretation of the negation, it is possible to use the 
default choice strategy proposed in [17]: a choice can be made among the solutions 
leading to the most significant membership degree and having the weakest 
complexity. 

Definition 3.5 The complexity of the nuanced property A, denoted as comp(A), is 
equal to the number of nuances (different from 0) required in its definition. 

Put: ^‘(x, A)=Max(|j,p(x) |Pg neg*p_ e(x, A)}. A particular choice can be done among 
the plausible solutions leading to ^‘(x, A), the most significant membership degree 
and having the weakest complexity. 

Definition 3.6 A choice of a nuanced property P satisfying the following conditions: 
II : Peneg'p, e(x, A), 12 : |dp(x)=^*(x, A), 13: VQeneg'p, e(x, A), {|j,q(x)=^*(x, A)=> 
comp(P)<comp(Q), defines “x is P” as the intended meaning of “x is not A”. 

Example: By using the solutions collected in Figure 4, “a is not low” receives as 
intended meaning “a is rather little high”. 

We can recall basic properties of linguistic negation ([13], [14], [15], [16]). 
Proposition 3.1 The knowledge about “x is A” doesn’t automatically define the 
knowledge about “x is not A”. 

Knowing exactly A does not imply, as does the logical negation, precise 
knowledge of its negation, since most of them require complementary information as 
the choice among the intended meanings. 

Proposition 3.2 If PeNeg'p^ (A) then AeNeg'p^ (P). But, if Peneg‘p_e(x, A), we 
cannot assert that Ag neg*p_e(x, P). So, the double negation of A belongs to a frame 
reference of the linguistic negation, but the intended meaning of “x is not P” does not 
generally lead to “x is A”. 

For example, the concept “height” can be characterised by the basic properties 
“small”, “medium” and “tall”. Then, the user can choose “x is small” as an intended 
meaning of “x is not tall”, and “x is really very tall” as the negation of “x is small”. 

Proposition 3.3 ( p<p’, 8’<8<p }^Neg‘p'_ e’ (A)cNeg‘p_ e(A)’ 



^ Note that this property implies that the reference frame associated with the boolean negation 
is included in each reference frame of the linguistic negation family defined here (more 
precisely it corresponds to the values p=l and e=0). Moreover, the combinations of different 
values for p and e satisfying previous conditions lead to a set of included reference frames. 
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Proposition 3.4 There exists p and 8 such that the negation p-compatible with the 
tolerance threshold 8 takes into account all previous interpretations of “x is not A”. 
Proposition 3.5 neg'p, e(x, A) can be an empty set. Even if the set of all nuanced 
properties associated with Ci, can be totally ordered, for any Ae the set neg'p ^(x, 
A) can be a non-convex set. 

Proposition 3.6 The model can deal with boolean basic properties without nuances 
by choosing p=l and 8=0. In this case, the standard forms correspond with the 
linguistic notion of marked (or not) property ([11], [8]). 



4 An Extended Modelisation of Linguistic Negation 

The previous approach to linguistic negation does not allow the use of assertions 
referring to a combination of nuances. It can pointed out that this possibility of 
combinations implicitly appears in standard forms F2 to F5. When the user asserts that 
“x is not P”, he can also refer to a combination of nuances based upon a conjunction 
{and denoted here a) or a disjunction {or denoted here v). For example, “John is not 
tall” since “John is small or extremely tall”, “Jack is not tall or small” since “Jack is 
really medium” or “My hat is not black and white” since “it is red and blue”. So, we 
have to modify the previous definition of linguistic negation in order to take into 
account this natural extension of the linguistic negation. 

In the following, we introduce the more general notion of extended linguistic 
negation in order to improve the abilities of previous model. For any linguistic nuance 
A, we suppose that A is its associated fuzzy set (defined by |J,a)- We suppose that the 
linguistic connectors v and a are defined with the fuzzy set operators u and n: 
M-aub(x)= max(|j,A(x), |J.b(x)) and |j,AnB(x)=min(|j,A(x), p b(x)). In order to extend the 
concept of linguistic negation, we introduce the notion of complex nuance. 

Definition 4.1 Given a concept Q, a combination of nuances h=l,...,p, based 

upon the operators v and a, denoted as U(Ai, ..., Aj,, ..., Ap), is said to be a complex 
nuance induced from %[. If no confusion is possible, then U stands for U(Ai, ..., Aj,, 
...,Ap). 

Example: The properties “low”, “medium” and “high” being associated with 
Ci=”wage”, then U=U (extremely low, medium, rather little high)=extremely 
lowv(mediumA rather little high) is a complex nuance induced from 

Definition 4.2 Let us suppose that Ci is a concept, Ft a standard form, and U(Ai, ..., 
Ah, ..., Ap) (or U) is a complex nuance induced from %i. The extended linguistic 
negation p-compatible with U with the tolerance threshold 8, given the standard form 
Ft, denoted as C_Neg*p e(U), is a set of complex nuances induced from 7l\. More 
precisely, V(Bi, ..., Bn,, ..., Bq) (or V) being a complex nuance induced from %i, then 



ordered by inclusion. From a linguistic point of view, this order encompasses the extend of 
the semantic fields covered by the different interpretations of the linguistic negation operator. 
This last point is currently under study. 
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VeC_Neg*p, e(U) if and only if: CDO’: VB^,, 3 Ah such that B„,GNeg‘p, e(Ah), CDl’: 
Vx, {|4u(x)>p => |4v(x)<e}, and CD2’: Vx, {|4v(x)>p=>|4u(x)<e}- Then, any 
Ve C_Neg*p e(U) is said to be an extended linguistic negation p-compatible with U 
with the tolerance threshold 8 , given the standard form Ft. 

Remark: It is obvious that Neg'p e(A)<zC_Neg*p ^(A), for any Ae^i. 

The following theorem will be at the basis of general properties of this extended 
linguistic negation. 

Proposition 4.1 A complex nuance Ve C_Neg‘p_ e(U) if : 

( 1 ) : UG%i and : 

(la) : V=Vi^lQ' with VleL, Q'e Neg'p, e(U), or 

(lb) : V=A|gLR' with VIgL, R'g Neg'p’e(U), or 

(l c) : V=QV(ai^lR') with Q°GNeg‘p,e(U) and VIgL, R'GNeg*p,e(U), or 

(l d) : V=RV(V|eLQ‘) with R°g Neg'p’ e(U) and VIg L, Q'g Neg‘p,e(U). 

(le) : V=U with Vg and Ug Neg'p, e(V). 

( 2 ) : U=AaB with Ag and Bg %, V=PvQ with Pg Neg'p, e(A) and Qg Neg'p, e(B), 

(3) : U=AvB with Ag 7h and Bg V=PaQ with Pg Neg'p ^(A), and Qg Neg'p ^(B). 

Proof. It is obvious to verify condition CDO’. Let us now examine conditions CDl’ 
and CD2’. The following elements of proof are based on definition 3.4. 

(la) : Put V=VieLQ* with VIgL, Q'GNeg'p, e(U). Then, pu(x)>p^{ VIg L, 

|4q1(x)<8}=>|4v(x)<8. Conversely we have: |4v(x)>P=1>{31g L, |4q1(x)>p } =>|4u(x)<8. 

(lb) : Put V=AieLR' with VIgL, R'GNeg'p, e(U). Then, |4u(x)>p^{ VIg L, |4r1(x)< 8) 
=>|4v(x)<8. Conversely, we have: |4v(x)>p=>{ VIg L, |4r1(x)>p}^|4u(x)<8. 

(ic): Put: V=QV(A|eLR') with Q°GNeg‘p, e(U) and VIgL, R'GNeg'p, e(U). Then, we 
have: |4u(x)>P=>{|4qO(x)<8 and VIgL, |4r1(x)<8)=1>|4v(x)<8. Conversely, we have: 
|4v(x)>P=1>{|4qO(x)>p or VIg L, |4r1(x)>p }=>pu(x)<8. 

(l d) : Put: V=R°A(VieLQ') with R°GNeg*p, e(U).and VIgL, Q'GNeg'p, e(A). Then, we 
have: |4u(x)>P=>{|4rO(x)<8 and VIgL, |4q1(x)<8}^|4v(x)<8. Conversely, we obtain: 
|4v(x)>P=1>{|4rO(x)>p and3\B L, |4r1(x)>p ) =>| 4 u(x)< 8 . 

(le) : If PGNeg'p e(U), then UGNeg‘p_ e(P). So, the double linguistic negation of A can 
be a linguistic negation of A. 

A and B being associated with the same concept, let us suppose that Pg Neg'p p(A) 
and Qg Neg'p p(B). Then: 

(2) : |4aab(x)>P=>{|4a(x)>P and |4b(x)>P)=>{|4p(x)<8 and |4q(x)<8)=>|4pvq(x)<8. 

Conversely, we have: | 4 pvQ(x)>p=>{| 4 p(x)>p or |4q(x)>p }=>{|4a(x)<8 or |4b(x)<8)=> 
|4aab(x) < 8 . 

(3) : A similar analysis gives us:|4avb(x)>P=>|4paq(x)<8 ; |4p^q(x)>p=>|4avb(x)<8. 

Example: Given the concept “wage” and the standard form F 3 , we can obtain: 
extremely highvreally mediumG C_Neg^o.v 5 , o. 35 (low), and very highvextremely lowG 
C_Neg^o. 75 ,o .35 (lowAmedium). 

Remark: Even if the results are restricted to nuance combinations associated with Cj, 
they lead to generalizations of classical Morgan’s laws and double negation law. 

In order to obtain the extended intended meanings of “x is not A”, we modify as 
follows definition 3.4. We can distinguish two cases: 
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(a) : Any complex property belonging to C_Neg*p_e(A) can be an intended meaning for 
any value of x. So, we can put : Vx, C_neg*p_e(x, A)=C_Neg*p_e(A). 

(b) : If it is not the case, we can only accept for this x, like in modified definition 3.4, 
the significant combinations. Put: C_neg‘p_e(x, A)= { Pg C_Neg‘p_e(A) | |a,p(x)>8) 
Definition 4.3 The set of intended meanings of “x is not A”, can be defined as 
follows: (a): either, Vx, C_neg*p_e(x, A)=C_Neg‘p_e(A), (b): or C_neg*p_e(x, A)= 
{PGC_Neg‘p,e(A)||ip(x)>8). Any PgC_ _neg*p p(x, A) is called an extended intended 
meaning for x of the linguistic negation p-compatible with A with tolerance threshold 
8, given the standard form Ft. 

Remark: It is also possible to propose a choice strategy of an extended intended 
meaning of “x is not A” by working in C_Neg*p_e(A) and by choosing as a complexity 
the sum of the complexities of the components of A. 

In the following, we suppose that the rule: if “x is A” then “y is B” receives as a 
translation p^Cx)— >ltB(y)'° where A and B are the fuzzy sets associated with A and B. 
Then, the following result extends the classical property: u^v=— iv— lU to the 
extended linguistic negation. 

Proposition 4.2 Let us suppose that the implication and its associated T-norm T 
satisfy the properties: u— >v=l iff u<v ; T(u—»v, v^w)<u— >w (weak transitivity law); 
u^v=— iv^— lU (contraposition law). Moreover, we suppose that: Vx, C_Neg‘p_e(A)= 
C_neg*p,e(x, A) and C_Neg'^p,e(B)=C_neg''p,e(x, B). 

Then, the extended linguistic negation possesses the following properties: 

(i) If there exists Qg C_Neg'^p_ e(B) and Pg C_Neg‘p_ e(A) such that Qcz— iB and — lAczP, 
then: Pa(x)^Pb(y)^ PQ(y)^Pp(x). In other words, the rule if “x is A” then “y is B” 
implies the rule if “y is not B” then “x is not A”. 

(ii) If there exists Pg C_Neg‘p_ e(A) and Qg C_Neg*p e(B) such that Pcz— lA and — iBczQ, 
then: pQ(y)^pp(x)< pA(x)^PB(y)- In other words, the rule if “y is not B” then “x is 
not A” implies the rule if “x is A” then “y is B”. 

(iii) If there exists Pg C_Neg*p e(A) and Qg C_Neg‘p_ e(B) such that — iA=P and — iB=Q, 
then pA(x)-»PB(y)=PQ(y)^ Pp(x). In this case, the rules if “y is not B” then “x is not 
A”, and if “x is A” then “y is B” are equivalent. 

Proof. It can be noted that Lukasiewicz’s implication: u^v=l if u<v else 1-u+v 
satisfies the previous properties. Let us suppose that “y is Q” is an intended meaning 
of “y is not B” such that the associated fuzzy set satisfies Qcz— iB. So, we have: 
|iQ(y)^|a,^B(y)=l- Then, transitivity gives us: |a,A(x)^M.B(y)=p-'B(y)^p-'A(x)< 
|a.Q(y)^|a.-,A(x). Let P be a fuzzy set such that: — lAczP. Then, |a,_,A(x)^p,p(x)= 1 . By 
transitivity, we obtain the following result: |a,A(x)^IlB(y)^ BQ(y)^Itp(x). We search to 
define P (or the corresponding property P) in terms of linguistic negation of B, 
knowing that generally — lA is not an extended linguistic negation of A. But, such a 
solution, satisfying — lAczP, can exist among the extended linguistic negations of A. In 
this case, P has the following form: P=V|gLP* with VIgL, P'GNeg'p e(A). Conversely, 
Let us suppose that “x is P” is an intended meaning of “x is not A” such that Pcz— lA. 



In order to avoid interpretation errors about the use of variables x and y, please refer to the 
explanation presented in Footnote 3 (Section 2). 
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A similar analysis leads to |j,A(x)^|J.B(y)^ltQ(y)^ltp(x) with Q, an extended linguistic 
negation of B satisfying — iBcQ having the following form: Q=VigLQ' with VleL, 
Q'e Neg"^p_ e(B). Finally, if — lA and — iB can be viewed as extended linguistic negations 
of A and B, we obtain the classical result. 

Example: Let us suppose that: P=rather smallvvery smallvextremely smallG 
C_Neg^p e(mediumvtall), rather smallvvery smallvextremely smallc— i(mediumv 
tall), Q=invisible=— ivisible. The rule “If the man is not visible in the crowd, he is not 
medium or tail” receives as translation “If the man is invisible in the crowd, he is 
rather or very or extremely small”. By using property 4.2(ii), this rule implies the rule 
“If the man is medium or tall, he is visible in the crowd”. 

Let us examine the assumptions concerning P and Q in Proposition 4.2 (ii), that is 
to say: — iBczQ and Pcz— lA. The conclusion of the initial negative rule requires that P 
the linguistic negation of A should be such that: Pcz— lA. It is obvious that generally 
many nuances of A fulfill this constraint. But the hypothesis of this rule requires that 
Q, the negation of B, satisfies a strong condition, since the fuzzy set associated with 
Q should contain — iB (i.e. — iBczQ). So, if — iB is not included in the union of all 
nuances belonging to the reference frame, we cannot apply this theorem. However, 
note that particular discourse universes can satisfy this condition. More precisely, this 
is the case where two or five basic properties are associated with each concept Ci. In 
the first case, each property will be the negation of the other, like “visible” and 
“invisible”. In the second case, the logical negation allows us to give pertinent 
meanings to five basic properties Pi^ through a one to one correspondence with the 
following totally ordered linguistic expressions: {less_than, at_most, 0, at_least, 
more_than } (Cf. Figure 5). Indeed, the five basic properties Pik can be based upon A 
and — lA, by choosing A=Pj3 and — iA=PiivPi5. 




Fig. 5. Particular Basic Properties 



Remark: It appears clearly that the extended modelisation of linguistic negation 
within a fuzzy context proposed here leads to a generalization of the logical negation 
defined within the boolean context. 



5 Conclusion 



We have presented an extension of a concept of linguistic negation, which can now be 
viewed as a generalization of the logical one, and this in satisfactory accordance with 
linguistic analysis. This new model improves the abilities of the previous ones, in 
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that, the extended linguistic negation possesses a more powerful set of properties. In 

particular, the user can now deny complex combinations of nuanced information. 
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Abstract. This paper presents a system of argumentation which cap- 
tures the kind of reasoning possible in qualitative probabilistic networks, 
including reasoning about expected utilities of actions and the propa- 
gation of synergies between actions. In these latter regards it is an ex- 
tension of our previous work on systems of argumentation which reason 
with qualitative probabilities. 



1 Introduction 

In the last few years there have been a number of attempts to build systems for 
reasoning under uncertainty that are of a qualitative nature — that is they use 
qualitative rather than numerical values, dealing with concepts such as increases 
in belief and the relative magnitude of values. Three main classes of system can 
be distinguished — systems of abstraction, infinitesimal systems, and systems of 
argumentation. In systems of abstraction, the focus is mainly on modelling how 
the probabilities of hypotheses change when evidence is obtained. Such systems 
provide an abstract version of probability theory, known as qualitative proba- 
bilistic networks (QPNs) [25], which is sufficient for planning [25], explanation 
[5] and prediction [18] tasks. Infinitesimal systems deal with beliefs that are 
very nearly 1 or 0, providing formalisms that handle order of magnitude prob- 
abilities. Such systems may be used for diagnosis [4] and have been extended 
with infinitesimal utilities to give complete decision theories [21,26]. Systems 
of argumentation are based on the idea of constructing logical arguments for 
and against formulae. Such systems of have been applied to problems such as 
diagnosis, protocol management and risk assessment [11], as well as handling 
inconsistent information [1], and providing a framework for default reasoning 
[10,16]. 

In a previous paper [17], we provided a hybridisation of the argumentation 
and abstraction approaches by introducing a system called the qualitative prob- 
abilistic reasoner (QV'JZ) which constructed arguments about how probabilities 
change. In this paper we extend the kind of reasoning possible using QV'JZ to 
deal with information about changes in utilities, thus providing a qualitative 
utility reasoner QU'R which provides an abstraction of classical decision mak- 
ing rather than just of probability theory and so captures the kind of reasoning 
possible in QPNs. 
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2 The logical language 

This section introduces the language used by our system. We build on the lan- 
guage of QV'JZ by introducing notions of utility, but to save space here we only 
deal with non-categorical changes in value, simplify the language by not deal- 
ing with logical conjunction, restrict ourselves to causally directed reasoning, 
and cut the discussion of those features drawn from QVTZ. A fuller account is 
contained in [19]. 



2.1 Basic concepts 

We start with a set of atomic propositions £, which includes the symbol V. We 
also have a set of connectives {-■, l+), and the following set of rules 

for building the well-formed formulae (wjfs) of the language. 

1. li I £ jC then I is a well-formed simple formula (swjf). 

2. If I is an swjf, then is an swjf. 

3. If I and m are swffs, then I ^ m is & well-formed implicational formula 
(*«#)• 

4. If I is an swjf, then I ^ V is a well-formed value formula (vwjf). 

5. If I, m and n are swffs, then / 1+) to n and / 1+) to F are well-formed 
synergistic formulae (ywjfs). 

We denote the set of all swjfs which can be defined using T by 5^, while Ic, 
and Vc denote the corresponding sets of iwjfs, ywjfs and vwjfs respectively. 
The set of all wjfs which can be defined using £ is W = U U 3^^ U Vc- 
yy may then be used to build up a database A where every item d £ A is a 
triple (i : / : s) in which i is a token uniquely identifying the database item 
(for convenience we will use the letter H’ as an anonymous identifier), I £ W, 
and s gives information about the probability of h In particular we take triples 
{i : I : f) to denote the fact that Pr(/) increases (due to some piece of evidence), 
and similar triples {i : I : f), to denote the fact that Pr(/) decreases. Triples 
{i : I : -H-) denote the fact that Pr(/) is known to neither increase nor decrease, 
and triples {i : I : fj) denote we don’t know whether Pr(/) increases or decreases. 
It should be noted that the triple {i : I : f) indicates that Pr(/) either goes up, 
or does not change — this inclusive interpretation of the notion of “increase” is 
taken from QPNs — and of course a similar proviso applies to {i : I : f). 



2.2 Non-material implication 

Now, does not represent material implication but a connection between the 
probabilities of antecedent and consequent. We take an iwjf, which we will also 
call an “implication” , to denote that the antecedent of the iwff has a probabilistic 
influence on the consequent. Thus we are not concerned with the probability of 
the iwff, but what the wjf says about the probabilities of its antecedent and 
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consequent. More precisely we take the triple {i : a ^ c : +) to denote the fact 
that: 

Pr(c|a,X) > Pr(cha,X) (1) 

for all X e {x, -la;} for which there is a triple {i : X ^ c : s) (where s is any 
sign) or {i : c ^ X : s). The effect of the X in this inequality is to ensure 
that the restriction holds whatever is known about formulae other than c and 
a — whatever the probabilities of a and c, the constraint on the conditional prob- 
abilities holds. It is possible to think of this as meaning that there is a constraint 
on the probability distribution over the formulae c and a such that an increase 
in the probability of a entails an increase in the probability of c. The triples 
{i : a ^ c : —) and {i : a ^ c : 0) denote that (1) holds with > replaced by 
< and = respectively. We also have implications such as {i : a ^ c : ?) which 
denotes the fact that the relationship between Pr(c|a, W) and Pr(c|-ia,W) is not 
known, so that if the probability of a increases it is not possible to say how the 
probability of c will change. 

With this interpretation, implications correspond to qualitative influences in 
QPNs, and, as is the case in all probabilistic networks, [20] are causally directed 
in the sense that the antecedent is a cause of the consequent. This restriction is 
necessary to ensure that QUIZ is sound, for the reasons discussed in [17]. 

2.3 Values 

The proposition V denotes the same thing as the value node in an influence 
diagram [13] — that is the utility of the decision maker. It can be used, just 
like any other swjf to form triples, and these denote a change in utility. Thus 
{i :V : '[) means that utility increases. QUIZ also makes use of triples based on 
vwffs, and a vwff {i : a ^ V : +) is taken to mean: 

U{a,X) >U{^a,X) (2) 

where, as before, X ranges across all other propositions which affect V, in this 
case all other propositions which are antecedents of vwffs. The meaning of the 
triple, as given by (2), is that a positively influences utility. Similar triples with 
sign — and 0 denote that (2) holds with > replaced by < and = respectively, and 
we use the sign ? to denote situations in which the relationship is not known. 

2.4 Synergy 

Being able to handle synergy relations is an important part of any qualitative 
probabilistic system. A detailed discussion of synergy is beyond the scope of 
this paper^, but, informally, there is synergy between two variables with respect 
to a third if a change in the value of one of the first two has an effect on the 
relationship between the second and the third. Thus, A and B have a synergistic 
relationship with respect to C, if an increase in the probability of A changes the 

^ See [5, 6, 25] for detail on the subject. 
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strength of the probabilistic influence between B and C. In our system synergies 
are represented by formulae such as a 1+) 6 c which represents the synergy 
which exists between a and h with respect to c. Such synergistic formulae form 
the basis of triples such as (i : a 1+) 6 c : +) in just the same way as simple 
and implicational formulae do, but with yet another denotation. In particular, 
(i : a 1+) 6 c : +) denotes the fact that: 

Pr(c|a, b, X) + Pr(cha, X) > Pr(cha, b, X) + Pr(c|a, ~^b, X) (3) 

where as ever, X ranges across all other formulae such that there are triples 
{i ■. X ^ c ■. s) ov {i ■. c ^ X ■. s). Similarly, {i : al+)6 c : —) and {i : al+)6 c : 0) 
denote that (3) holds with > replaced by < and = respectively. As with the case 
of implications, synergies have sign ? when the relationship is not known. These 
synergy expressions are [18,25] precisely the conditions necessary and sufficient 
to capture the fact that a change in Pr(a) has an effect on the influence of Pr(6) 
on Pr(c). It is perfectly possible to have synergies with respect to the value node 
represented by triples such as {i : a [tl b V : +) . This latter denotes the fact 
that: 

U{a, b, X) + U{a, ^b, X) > U(^a, b, X) + U{a, ^b, X) (4) 

where X is as before. Similarly, {i : a\±ib'^ V : — ) and {i : a\±ib'^ V : 0) denote 
that (4) holds with > replaced by < and = respectively. Note that all synergies 
are symmetrical, and that the synergies we deal with here are known as additive 
synergies. In contrast, QVTZ [17] deals only with product synergies. 

3 The proof theory 

The previous section introduced a language for describing probabilistic influences 
between formulae. For this to be useful, we need to give a mechanism for taking 
sentences in that language and using them to derive new sentences. 

3.1 Arguments 

We derive new sentences using the consequence relation \-qu which is defined 
in Figure 1. The definition is in terms of Gentzen-style proof rules where the 
antecedents are written above the line and the consequence is written below. 
The consequence relation operates on a database consisting of the kind of triples 
introduced in the previous section and derives arguments about formulae from 
them. There are two types of argument^: 

Definition 1. An influence argument for a well-formed formula p from a data- 
base A is a triple S{p, G, s) such that A I-qu S{p, G, s) 

The sign s of an influence argument denotes something about the change in the 
probability of p which can be inferred given the grounds G — the elements of the 
database used in the derivation of p. 

^ The use of S and Y to denote the different types is taken from [25]. 
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S-rules 

Axl 



A^qu S{St,{i},Sg) 



{i : St : Sg) ^ A, St ^ Sc Ulc LI Vc 



n-E. 



A\-qu S{^St,G,Sg) 
A^qu S{St,G,neg{Sg)) 



A^qu S{St,G,Sg) 
A^qu S{^St,G,t,eg{Sg)) 



>-E. 



>-E 



A ^Qu S{St, G, Sg) A ^qu S{St ^ St', G',Sg') 
A 'tqu S{St', G U G', irnpeiim(5p, 5p')) 

A hQu S{St, G, Sg) A ^qu S{St 4 V, G',Sg') 



A 'tqu S{V, G U G', valp.op(5p, 5p')) 



Y-rules 



Ax2. 



A^qu Y{{St”,St,St’),{i},Sg) 



{i:Stl±l St' St" : Sg) € A 



Ax3 



(i : 5t l±) 5t' 4 1/ : Sg) 6 A 



Y-Il 



Y-I2 



Y-I3 



AhQuY{{V,St,St'),{i},Sg) 

A ^Qu S{St ^ St',G,Sg) A ^qu Y{{St, St" , St'"),G' , Sg') 
A ^Qu Y{{St', St", St'"), G U G', synp.op(5p, 5p')) 

A hQu S{St ^ St', G, Sg) A hQu Y{{St", St', St'"), G' , Sg') 
A hQu Y{{St",St,St'"),GUG',syn„o,{Sg,Sg')) 

A ^Qu S{St ^ St', G, Sg) A ^qu Y{{St", St'", St'), G' , Sg') 
A hQu Y{{St",St,St'"),GuG',syn„o,{Sg,Sg')) 

Fig. 1. The consequence relation \-qu 



Definition 2. A synergy argument for a well-formed formula p from a database 
A is a triple Y {{p, q, r),G, s) such that A I-qu Y {{p, q, r),G, s) 

Such an argument indicates that q and r have a synergistic effect on p. The 
sign gives the synergy of q on the relation between r and p, or, equivalently, the 
synergy of r on the relation between q and p. 

To see how the idea of an argument fits in with the proof rules in Figure 1, 
consider the rules ‘Axl’, and The first says that from a triple {i : I : s) 

it is possible to build an argument for I which has sign s and a set of grounds 
{i} (the grounds thus identify which elements from the database are used in 
the derivation). The rule is thus a kind of bootstrap mechanism to allow the 
elements of the database to be turned into arguments to which other rules can 
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Fig. 2. The functions neg, irnpeiim and valprop- 



be applied. The second rule can be thought of as analogous to modus 

ponens. From an argument for a and an argument for a c it is possible to 
build an argument for c once the necessary book-keeping with grounds and signs 
has been carried out. The proof procedure used here has an important difference 
from other similar logical proof systems which stems from the fact that QUIZ is 
dealing with probability values (albeit changes in probability) rather than just 
truth and falsity as is the case in classical logic. In logic, once there is a valid 
proof for a formula, the formula is known to be true. Here we may have several 
arguments which suggest different things about the probability of a formula and 
it is necessary to establish all the arguments and then combine them. 

3.2 Combination functions 

In order to apply the proof rules to build arguments, it is necessary to supply the 
functions used in Figure 1 to combine signs. Broadly speaking, all these functions 
are exactly those introduced by Wellman [25] for the analogous operations in 
QPNs^. The rules for handling negation are applicable only to swjfs and permit 
negation to be either introduced or eliminated by altering the sign, for example 
allowing {i : : t) to be rewritten as {i : a : 1). This leads to the definition of 

neg: 

Definitions. The function neg : Sg € {t, 'll i-A Sg' € 'll is 

specified in Figure 2. 

To deal with implication we need the function irnpeiim to establish the sign of 
formulae generated by the rule of inference ^--E. This means that irnpeiim is used 
to combine the change in probability of a formula a, say, with the constraint 
that the probability of a imposes upon the probability of another formula c. 

Definition 4. The function irnpeiim : Sg € {t, 'll x Sg' € {+,0,—,?} ha 
Sg" G I) is specified in Figure 2. 

We also need the function valp^p which makes it possible to determine the 
changes in utility. 

Definition 5. The function valprop : Sg € {tr G3-, j;} x Sg' € {+,0,—,?} ha 
Sg" e is specified in Figure 2. 

® The reason our notation differs is to allow our system to be extended to handle 
categorical information exactly as in [17]. 
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Fig. 3. Synergy propagation synprop and flattening functions flats and flaty- 



This function is virtually identical to irnpeiim, differing only in that it combines a 
change in probability with a utility to give a change in expected utility, whereas 
irnpeiim derives a change in probability from a change in probability and a rela- 
tionship between probabilities. We also need the function syriprop in order to be 
able to reason with synergies. 

Definition 6. The function syriprop : Sg € {+,0,—,?} x Sg' € {+,0,—,?} i-4- 
Sg” e {+, 0, — , ?} is specified in Figure 3. 

These functions are sufficient to apply \~qu to build both influence and synergy 
arguments. 



3.3 Flattening 

In general it is possible to build several arguments for a single proposition. To 
get firm conclusions we need to flatten all the arguments for a proposition to 
get a single sign which tells us the combined change in the probability of that 
proposition. We can describe this in terms of a function Flats (•) which maps 
from a set of influence arguments As for a proposition St built from a particular 
database A to the pair of that proposition and some overall measure of validity: 

Flats : As S{St,v) 

where As is the set of all influence arguments which are concerned with St, that 
is: 

As = {S{St, Gi,Sgi) I A ^qu S{St, Gi,Sgi)} 

and V is the result of a suitable combination of the Sg that takes into account 
the structure of the arguments. Since in the precise case we are considering here, 
the structure is unimportant (though in very similar cases it must be taken into 
consideration [17]) we can ignore the grounds and define v as: 

v = i\3ts{{Sgi I {St,Gi,Sgi)G As}) 

where flats is as defined in Figure 3. We can formalise a similar notion for synergy 
arguments in terms of a function FlatY(-) which maps from a set of synergy 
arguments Ay for a proposition St to the pair of that synergistic relationship 
and some overall measure of validity: 

FlatY:AvH^ Y{{St,St' ,St"),v) 
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where Ay is the set of all synergy arguments which give the synergistic effect 
of St' and St" on St 

Ay = {Y{{St, St',St"),Gi,Sgi) \ A tgu Y{{St, St', St"),Gi,Sgi) 

or A tQuYiiSt, St", St'), Gi,Sgi)} 



and V is defined by: 

u = flatY({5ft I {{St,St',St"),Gi,Sgi) G Ay}) 
where flaty is given in Figure 3. 

4 Soundness and Completeness 

We can show that QUIZ is sound with respect to decision theory, and determine 
bounds on what it can deduce. First consider soundness"^: 

Theorem 7. The construction and flattening of influence and synergy argu- 
ments in QUIZ using \-qu is sound with respect to decision theory. 

To prove completeness, one first needs to establish a proof procedure. The pro- 
cedure for computing the effect on some formula p is: 

1. Add a triple {i : q : s) for every formula q whose change in probability is 
known. 

2. Build As, the set of all influence arguments for p. 

3. Flatten this set to S{p,vs). 

4. Build Ay, the set of all synergy arguments for p. 

5. Flatten this set to Y(p,vy). 

This naturally backward chaining procedure can obviously be extended to com- 
pute the effect on a whole set of propositions. Now, we also need to define the 
sense in which we consider the system to be complete. 

Definition 8. A well-formed formula p is said to be a cause of a well-formed 
formula q if and only if it is possible to identify an ordered set of iwffs {p 
oi,ai 02 , ■■■ ,On g). If g is the value proposition V, the final member of 
the set is a„ -A y. 

In other words p is a cause of g if it is possible to build up a trail of (causally 
directed) implications which link p to g. We have a similar notion for synergies: 

Definition 9. A well-formed formula p is said to be a synergistic cause of a 
well-formed formula g if there is a ywff a 1+) 6 c such that p is a cause of either 
a or h and c is a cause of g. If g is the value proposition V, then the ywff in 
question is of the form a 1+) 6 V. 

* All the proofs in this section are straightforward but lengthy, and so have been 
omitted to save space. They may be found in [19] and are simple extensions of those 
in [17]. 
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Definition 10. A well-formed formula q is said to be an ejfect (respectively a 
synergistic effect) of a well-formed formula p if and only if p is a cause (respec- 
tively a synergistic cause) of q. 

Definition 11. The construction and flattening of arguments is said to be 
causally complete in some system of qualitative utility with respect to some for- 
mula p if it is possible to use that system to compute the changes in probability 
of all the effects of p. 

Given these definitions we can prove that QWZ is complete in the following 
sense: 

Theorem 12. The construction and flattening of influence arguments in QUIZ 
using \-qu is causally complete with respect to any simple well-formed formula. 

We also need to deal with synergy arguments. For them we need the following 
notion of completeness: 

Definition 13. The construction and flattening of arguments is said to be syn- 
ergistically causally complete in some system of qualitative utility with respect 
to some formula p if it is possible to use that system to compute the synergies 
involving p and all its synergistic effects. 

Given this we can show that: 

Theorem 14. The construction and flattening of synergy arguments in QUIZ 
using \-qu is synergistically causally complete with respect to any simple well- 
formed formula. 

Note that completeness is defined only in terms of swffs. This restriction is 
considered in detail in [19]. 



5 Example 

This section presents a short example of the kind of reasoning possible in QWZ. 
Since the example is one used in [25] , it also helps to informally demonstrate the 
fact that QWZ captures the kind of reasoning possible in QPNs. 

The example concerns the decisions made about digitalis therapy, and comes 
initially from [22]. An increased dosage of digitalis {dig) has a negative effect 
on conduction (con) (rl) and a positive effect on automaticity (aut) (r2). A 
negative effect on conduction is the aim of the therapy since the conduction has 
a positive effect on heart rate (hr) (r3) and a reduction in heart rate is what is 
required (r4). Automaticity has a positive effect on ventricular fibrillation (vf) 
(r5), a life threatening state (r6). High calcium levels (Ca) also have a positive 
effect on automaticity (rl). Increasing the digitalis dose makes automaticity 
more sensitive to calcium level (r8), and an increased heart rate means that 
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ventricular fibrillation has a more severe effect on the patient’s well-being. This 
information can be expressed as: 



(rl 


dig con 




(r4 


7 V 

hr 
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(r7 


Ca —> aut 


■■ +) 2ii 
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dig aut 


:+) 
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Adding (/I : dig, t), indicating increased digitalis dosage, to this database, we 
can build the influence arguments: 

5(y,{rl,r3,r4}, t) 

S{V,{r2,r5,r6}, ;) 

These indicate, respectively, that there are reasons to both think that overall 
utility will increase and that it will decrease. These flatten to give S{V, |) indi- 
cating, exactly as with the equivalent QPN, that there is no conclusive argument. 
We can also build two synergy arguments connecting dig and Ca with V : 

Y {{V, Ca, dig), {r8, r5, r6}, — ) 
Y{{V,dig,Ca),{r9,r5,r7,r3,rl},-) 

These flatten to give Y {{V, dig, Ca), —), indicating that digitalis dosage and cal- 
cium level have a negative synergistic effect on overall utility. Thus increasing 
digitalis dosage reduces the effect that an increase in calcium level has on utility. 

6 Discussion 

The system introduced in this paper has its roots in Wellman’s QPNs [25], 
the first attempt to build a qualitative decision theory, and draws its notion of 
“qualitative” from QPNs. This is a notion close to that in qualitative physics [14] 
where the basic abstraction is that which distinguishes between positive, negative 
and zero quantities and the derivatives of those quantities. The main focus in 
both QPNs and QUIZ is on the way in which values change with evidence. 
These two factors, the extreme abstraction and the concentration on change, 
distinguishes both QUIZ and QPNs from other qualitative systems. 

As mentioned in the introduction, there have been a number of attempts to 
devise qualitative decision theories where “qualitative” is taken to means some 
form of relative order of magnitude based upon infinitesimal quantities. The 
first such effort was that of Pearl [21] which abstracted utility values in this 
way (earlier work, such as that of Darwiche [3] and Goldszmidt [12] had dealt 
with probabilities of this form). In doing this. Pearl thus provided an order of 
magnitude version of classical decision theory. This was then extended by Tan 
[23, 24] to deal with conditional preferences, so that it is possible to base decisions 
on statements like “if /3 is preferred to a” . Around the same time Wilson [26] 
provided an alternative way to formulate Pearl’s original qualitative version of 
classical decision theory, and more recently Lehmann [15] has made a similar 
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proposal. The strand of this work which is most similar to ours is that of Bonet 
and Geffner [2], who also keep track of the reasons behind the decision, in terms 
of the information used to reach it. 

The use of a different notion of “qualitative” is that investigated by Dubois, 
Prade and colleagues [7-9] . Their system has a possibilistic rather than a prob- 
abilistic semantics and is qualitative in the sense that only the ordinal rank of 
quantities is important. It should be noted, however, that the values they use 
are not infinitesimal (though one could build an infinitesimal version of their 
theory), and so can be considered more expressive than those of Pearl et al. It 
should also be noted that while, as described here, our system has a probabilistic 
semantics, we can give it alternative semantics, as discussed in [19]. 



7 Summary 

This paper has extended our previous work on proof theoretic approaches to 
qualitative probabilistic reasoning [17] in two important ways. First this paper 
has extended it to deal with statements of utility, making it possible to reason 
about changes in expected utility as well as about changes in probabilities. This 
is an important step in developing a qualitative decision theory. Second, this 
paper has dealt with the concept of additive synergy, which is important in 
determining dominating decision options [25]. 
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Abstract. We propose a new approach to functional regression based 
on fuzzy evidence theory. This method uses a training set for computing 
a fuzzy belief structure which quantifies different types of uncertainties, 
such as nonspecificity, conflict, or low density of input data. The method 
can cope with a very large class of training data, such as numbers, in- 
tervals, fuzzy numbers, and, more generally, fuzzy belief structures. In 
order to limit calculations and improve output readability, we propose a 
belief structure simplification method, based on similarity between fuzzy 
sets and significance of these sets. The proposed model can provide pre- 
dictions in several different forms, such as numerical, probabilistic, fuzzy 
or as a fuzzy belief structure. To validate the model, we propose two 
simulations and compare the results with classical or fuzzy regression 
methods. 



1 Introduction 

The problem of modeling a process from observed data has been is an impor- 
tant topic in several disciplines ranging from nonlinear regression to machine 
learning and system identihcation. In classical regression analysis, this can often 
be seen as a problem of global or local functional approximation. Examples of 
global functional estimators are generalised linear models, smoothing splines [9] 
and multi-layer perceptrons. In local models, including Generalised Radial Basis 
Functions, Kernel regressors, regression trees, or memory-based methods such as 
the “lazy” method [2], model description is divided in several parts of the input 
space. 

Although statistical regression analysis is one of the most widely used tech- 
niques to model relationships between variables, it is not well adapted to all 
kinds of uncertainties which can be encountered in real applications. Evidence 
theory [3, 14], fuzzy set theory, and fuzzy evidence theory [15,21] each highlight 
different forms of uncertainties [12], and, as such, provide valuable alternatives 
to classical probability theory for data analysis. Several approaches such as fuzzy 
or interval linear regression [7], fuzzy [20, 17] and neuro-fuzzy inference systems 
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[11] aim to model vague or fuzzy phenomena in functional estimation. A neu- 
ral network approach based on evidence theory has also been proposed as a 
functional approximation method [6]. 

Here, we propose a new regression method, based on fuzzy evidence theory, 
which can be applied to a large class of output data, such as intervals or fuzzy 
numbers. A more general case taking into account probabilistic uncertainty of 
target values is also treated. The formalism of belief structures is used to assess 
and quantify uncertainty in the output data. A procedure is proposed for com- 
puting a belief structure from a training set, adapting a general principle formerly 
introduced in a discrimination context [5]. The concept of pignistic probabilities 
allows to transform degrees of belief into probabilistic conhdence values that 
may be attached to predictions. We show how our model can integrate different 
kinds of uncertainties such as imprecision, discord and ignorance. 

To prevent an exponential increase of focal elements during the combination 
process, a simplihcation method is also proposed, based on two criteria: similarity 
between fuzzy sets and weight of these sets. The similar fuzzy sets are aggregated 
and the “least representative” fuzzy sets are eliminated. 

The paper is organized as follows. In Section 2, we briefly recall the basics of 
evidence theory and its fuzzy extension. In Section 3, we describe the proposed 
model and a variant using prototypes. The simplihcation algorithm is presented 
in Section 4. Finally, we give some numerical results in Section 5 and conclude 
in Section 6. 

2 Fuzzy Belief Structures 

In this section, we briehy introduce some of the main ideas behind the Dempster- 
Shafer theory of evidence [3,14,19,16]. Let i? be a hnite set, and let S[fi) be 
the set of all subsets of 17. The fundamental concept for representing uncertainty 
is that of belief structure, also called basic probability assignment, dehned as a 
function m from S[fi) to [0, 1] verifying: ^(4l) = 1. A belief structure m 

such that m(0) = 0 is said to be normal. Any subset of 17 such as m(A) > 0 is 
called a focal element of m. We will denote by F(m) the set of focal elements 
of m. The information provided by a belief structure can be represented by a 
credibility function or a plausibility function dehned, respectively, as: bel(A) = 
and pl(A) = EsnA=0 = bel(l7) - bel(A). The quantity 

bel(A) is interpreted as the total belief committed to A and pl(A) as the belief 
that might be committed to A, if further information became available. They 
can be regarded, respectively, as lower and upper bounds of a set of compatible 
probabilities. One of these, the pignistic probability, is particularly useful for 
decision making [16]; it is dehned as: 

BetP(„)=y 

Acn ' ' 

where d is the Kronecker symbol and |A| denotes the cardinality of A. Evidence 
theory can easily be generalized to continuous spaces provided the number of 
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focal elements \F[m)\ remains finite [ 8 ] (further generalizations are possible, but 
they are outside the scope of this paper). 

One of the most important operations in the theory is the procedure for 
aggregating multiple belief structures on the same variable. For any binary set 
operation V, the fusion of two belief structures mi and m 2 , noted m = mi Vm 2 , 
may be dehned as: m[A) = '^b-<^c=a ^1 (-^)^2 (O), for all A ^ f2. This oper- 
ation may produce a subnormal belief structure, i.e. one may have m( 0 ) > 0 , 
particularly in the case of the conjunctive sum dehned originally by Dempster, 
where V = H. The Dempster normalization procedure converts a subnormal be- 
lief structure m into a normal one m* dehned as follows: m*{A) = for 

T 7 ^ 0 and m* ( 0 ) = 0 . 

Fuzzy extensions of evidence theory have been proposed by different authors 
[15,21]. The basic idea is to allow the focal elements of a belief structure to be 
fuzzy sets. If T is a fuzzy subset of i?, we will denote by T(.) the membership 
function of A and by h[A) its height. If F[f2) denotes the set of fuzzy sets of i?, 
a fuzzy belief structure (FBS) is a function m from F[f2) to [0, 1] such that: 

r m(A) = 0 if A ^ F[m) 

1 J2AeF{m) = 1 

for some hnite subset F(m) of T[f2). Here again, T is a focal element of m if 
m[A) 7 I 0. If at least one of the focal elements is not normal, m is called a 
subnormal fuzzy belief structure. Yager [20] proposed a “smooth normalization 
procedure” (SNP) for converting a subnormal fuzzy belief structure into a nor- 
mal one. This method generalizes both fuzzy set normalization and Dempster’s 
normalization of crisp belief structures. Assume m to be a subnormal fuzzy belief 
structure with focal elements The SNP converts m to a normal FBS m* 

with focal elements such that: 



Ei{oj) = Fi{oj)/h{Fi) Vwef? 

in — J 27 =imiF)h(F,)- 



( 2 ) 



The combination of fuzzy belief structure can also be easily generalized. What- 
ever the used combination operator V, the fuzzy version is employed. Thus, if nii 
and m 2 are two fuzzy belief structures with focal elements and re- 

spectively, and m = miUm 2 , the focal elements on m are dehned as: Ck = AiUBj 
where Cfe(w) = and S' is a t-conorm. The corresponding weight 

is m(Ck) = uB =Ck (Bj). The concept of pignistic probability can 

also easily be extended as: BetP(w) = '^AeF(m) A(cj). 



3 Application to Regression Analysis 

3.1 The Proposed Model 

In this paper, we propose to apply the tools of fuzzy evidence theory to regression 
analysis. This approach extends the method formerly proposed by Denoeux [4, 
5] in the context of pattern classihcation. 
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In the following, we will denote by y a fuzzy quantity, belonging to T[y), 
where is a given reference set. Let (*i, ,n be a training set of input- 

output data vectors belonging to [X\ x . . .x X^) x T[y), where each Xj and y 
are supposed to be continuous real spaces. The output data can thus be either 
real numbers, intervals, fuzzy numbers or any fuzzy quantities. Let us note X 
the Cartesian space Xi x ... x Xr- 

Let X denote an arbitrary input vector, and let y be the corresponding un- 
known output value. Each pair [xi,yi) may be regarded as a piece of evidence 
inducing some beliefs concerning y. If one learns that x is “close” to Xi according 
to a relevant metric d, we can expect the corresponding output y to be similar to 
yi. This may be modeled by assigning to yi a certain fraction of a unit amount of 
belief, depending on the distance between x and *,■. Since no other hypothesis is 
made, the remaining mass of belief should be allocated to the whole set y. We 
thus obtain a FBS rrii on y, with focal elements F(mi) = {ypT}, dehned as: 

{ rtiiim) = (t>[d{x, Xi)] 

mi {y) = I - (j)[d{x,Xi)] (3) 

mi[A)=Q ^A(^T{y)\F{mi), 

where (f) is a decreasing function such as <?i(0) = 1 and limd_>.oo 4’{d) = 0. In the 
following, we will denote 4ii{x) = (l)[d)x, Xi)]. For example, if <f) is the exponential 
function, and d is a Euclidean distance induced by some symmetric positive 
dehnite matrix F: <l)i{x) = exp[— (* — Xi)'^ F{x — Xi)]. 

We note that, by this means, several kinds of uncertainties are clearly iden- 
tihed: fuzziness, imprecision or nonspecihcity (related to the cardinality of yi) 
and ignorance, represented by the weight assigned to the reference output set y. 

In order to combine the information provided by each piece of evidence, we 
use the generalization of the conjunctive rule of combination to FBS. Then, the 
Rnal FBS is: m = To decrease the computational load, the combination 

can alternatively be restricted to the k nearest neighbors of x in the training 
set: {*( 1 ), . . . , *(fe)}. The final FBS then becomes: m = 

The pignistic probability function induced by m may then be defined as 

AeF{m) ' ' 

where |yl| = Jy A)y)dy, since T is a continuous space. A point estimate of y can 
be obtained by taking the expected value of the output: 




Let us note the center of area of the set A, y*y = ' Then, E(Y) = 

EagF(™) MA)y*A- 

In the following, the model defined by Eq. (3) will be referred to as FBSl. 
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As remarked by Smets (personal communication), if the yi are crisp numbers, 
then rrii and m become standard belief structures with only n + I crisp focal 
elements and the hnal belief structure can be easily expressed as (FBS2 model): 

( m({yi]) = Vi e {1, . . . ,n} 

S = irifcjl - (4) 

[m{A) =0 yAe T{y)\F{m), 

where K = HfeJl “ <f>i{x)] + Y^=i “ ^j(x)] is the normalization 

factor, and F{m) = {{t/i}, . . . , {yn},y}- 

3.2 Generalization to Conflicting Ontpnts 

We propose to generalize Eq. (3) to the case where there is some conflict be- 
tween different possible outputs for a given input *,■. We consider that the 
knowledge of output yt can be modeled as a FBS rUy^, such as F(my.) = 
{yn,. ■ ■ ,yij,. ■ ■ ,yi,j{i)}, where each yt is a fuzzy quantity, and m{yi) = pij. 
We will call such a number niy^ a fuzzy belief structure number (FBS number). 

If we now consider a new vector x, (xi,my.) dehnes a piece of evidence 
concerning the output y oi x, which can be modeled by a FBS m,- on y, whose 
focal elements are F(mi) = F(my.) U Ji. The belief structure m,- is dehned as: 

i mivij) = Pijfi{x) Vj e {1, . . . 
mi{y) =l-(f)i{x) (5) 

mi {A) =0 'iAeF{y)\F{mi). 

Flere again, the number of elements in F(m), which can reach in the worst case 
nr=iV(i), can be reduced by considering the k nearest neighbors of x in the 
training set. This model will be referred to as FBS3. 

3.3 Approach based on Prototypes 

For the sake of simplicity, we will assume in this section the yi to be real numbers. 
If the training set is large, it can be summarized by a smaller number of reference 
vectors or prototypes. Let C = {ci G T, i G {1, . . . , /}}, and E = {cj G T,i £ 
{1, . . . , J}} be two sets of I input and J output prototypes, respectively. These 
prototypes may have been created by means of any clustering method such as 
the c-means or fuzzy c-means algorithms, separately or not. If we associate every 
* £ T to its nearest prototype in C according to a given distance, the set X 
can be partitioned in I classes [Ci)l_-^, characterized by cluster centers c,- and 
scatter matrices Sf = ^^^-iUik(xk — Ci)'^(xk — c,-), where Uik denotes the 
membership degree of Xk to class C\. In the same way, y is partitioned in J 
classes Let us assume that the [xk,yk) are a sample coming from N 

independent random vectors (X^, Y^) on a probability space [[X ,y), A,V xy ), 
and let us denote by pij the probability Vy\xiX £ Ej\X £ Ci). The output 
prototypes Cj can be “fuzzihed” as Cj £ F[y) by taking the estimated standard 
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deviation (t| of Ej into account. For example, since the variables are supposed 
to be real and continuous, the Cj can be assumed to be Gaussian fuzzy numbers, 

denoted [cj, aj) and dehned by: ej[u) = exp , Vm £ C M. 

For vector x, each c,- provides a piece of evidence concerning the output y, 
modeled by a FBS rrii similar to Eqs. (3-5): 

i miej) = Pij<l)[d{x, Ci)] Vj e {1, . . . , J} 
niily) = 1 - ^[d{x,Ci)] (6) 

mi{A)=o 'iAeE{y)\{Si,...Sj,y}. 



Flere, \F[m)\ < 2^ . The number of focal elements is generally much smaller than 
in FBSl. This model is referred as FBS4. 

Remark that, if we “fuzzify” each component Cij of c,- by Gaussian fuzzy 
numbers Cij = {cij,crfj) G Xj, and if we denote by c,- the Gartesian product of 
the Cij, we have the following equality for every x G X: Cj (*) = (l)[d[x, c,)] if <f) 
is the exponential function. 

Let us now estimate the probability matrix P = {pij )- If 'n'ij = Pxy{{X, Y) G 
Ci n Ej ) and Nij is the random variable counting the pairs of (xk, yu ) belonging 
to Ci X Ej, by dehnition Nij ~ B[n,'Kij). The Strong Law of Large Numbers 
justihes theoretically the following estimation: 



Pij - PY\x{Ej\Ci) 



PxyjEj n Ci) 

Px{C,) 



Ej 



In particular, pij (z Ej\X G Ci) almost surely. 

Another problem that remains to be solved in this approach is the determi- 
nation of the optimal number of prototypes. The problem is essentially the same 
as that of fuzzy system identihcation, for which numerous methods for hnding 
the optimal number of clusters have been suggested in the literature [10]. Various 
criteria have been proposed, based on empirical fuzzy covariance matrices 



Ji = 



X ^ ^i){^k ^i) 



k = \ 



E 



or within-classes scatter measures: 

m n 

ssw = EE Uikd^(xk,Ci). 

i = l k = l 



In particular, a simple but efhcient approach is to consider the classical “ex- 
plained inertia ratio” SSW/SST, where SST = EEi ®) ® 

global mean vector. This ratio is a decreasing function of I, a suitable value ^ * 
may be chosen by imposing an acceptable minimal threshold. 

Once an input prototype c,- has been determined, we propose to induce the 
corresponding output prototype as the Nad araya- Watson Gaussian kernel re- 
gressor evaluated at cp 

_ EEi Vk exp[-(cj - Xk)’^A{cj - Xk)] 

EEi exp[-(ci - Xk)'^ A[ci - Xk)] 
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where is a symmetric definite positive matrix to be estimated. For the sake 
of simplicity, A can be assumed to be diagonal: A = Optimal values of 

the Aj can be obtained by minimizing the classical cross validation criterion for 
a given test set. 

4 Simplification Procedure 

A well-known drawback of evidence theory based methods is that they are gener- 
ally time consuming due the exponential increase of focal elements encountered 
when combining large numbers of belief structures. We recall that the num- 
ber of focal elements can reach \T{m)\ = 2"^ with FBS4. This implies different 
kinds of problems. Practically, the computation of the outputs becomes too time- 
consuming. Moreover, even if the results can be computed, some of individual 
focal elements of the hnal FBS may not have a clear interpretation. Some of them 
could represent approximately the same output, and the resulting information 
would be so complex that it would become difhcult to exploit. 

So, our problem is to control the complexity of the output structure. For that 
purpose, we propose to dehne a simplihcation procedure for any (crisp or fuzzy) 
belief structure. Some authors have already proposed various methods in case 
of classical belief structures [1]. Flere we propose two ways of simplifying a FBS 
m: elimination of least representative focal elements and aggregation of similar 
focal elements. During combination, for a chosen threshold w, if m[Fi) < w, F( 
is eliminated and the FBS m is re-normalized according by. In the aggregation 
phase, two steps can be dehned: choosing which focal elements have to be re- 
placed and replacing two focal elements by a new one. The hrst question requires 
the dehnition of a measure of sirmlanty between fuzzy sets. The second one is an 
information fusion problem. In the following, we will focus on the aggregation 
phase, for which a hierarchical clustering approach is proposed. 

Let m be the initial FBS with focal elements {Ti, ... , T„}. A new FBS m' is 
obtained by replacing the 2 nearest focal elements F( and Fj of m by a new one 
F' and keeping the other focal elements. Then F(m') = [F(m) U F')\[Fi U Fj). 

Among similarity measures of fuzzy sets that have been dehned in literature 
[13, 22], we use the following set-based measure S{A, B) = . Then Fi and Fj 

are chosen such as S[Fi, Fj) = max^ ; S[Fk, Fi). Different means of aggregating 
the similar focal elements can be used, depending on the binary operator V: 
F' = Fi'VFj with m' [F') = m[Fi) -\- m[Fj). We can separate these operators in 
3 classes: disjunctive, neutral and conjunctive. To keep as much information as 
possible, it seems preferable to take the union of the two sets F' = F( U Fj as 
the new focal element. Moreover, this choice leads to a cautious approximation, 
since m' is less informative than m in terms of the nonspecihcity N , dehned as 
N )m) = I-^I- It i® easy to verify that N(m') < N(m), which 

proves that in' is less specihc than m. However, another possibility is to take a 
more neutral operator by considering the weighted average of the sets, so that: 

F'iuj) = 



m{Fi ) Fj (ui ) + m{Fj) Fj [ui ) 
m{Fi) + m{Fj) 
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We then have an interesting analogy with Ward’s method in hierarchical clus- 
tering [18]. 

This iterative aggregation procedure stops as soon as maxk, i S{Fk, Fi) < S, 
where d is a given threshold. 

5 Results 

In order to demonstrate the validity and predictive capabilities of our method, 
we present the results for several simulations. In the case of crisp data, prediction 
accuracy was measured by the mean squared error (MSE) E = ^ Y'lk-iiVk—yk)''^- 
To describe the information content of the outputs, two classical evidential uncer- 
tainty measures [12] were used: nonspecihcity, which represents set imprecision, 
and strife: 

S'(m) = - "*(^)log 2 

AeF{m) BeF{m) ' ' 

Example 1: In this regression problem [6], * G M is taken from a mixture 

of two Gaussian distributions: f(x) ~ 0.57V’(— 1.5, 0.5) -f 0.57V’(1.75, 0.5) and 
y = sin(3*) + X + e[x), where e[x) ~ 7V’(0, 0.01) if * < 0 and e[x) ~ 7V’(0, 1) if 
* > 0. Here, N = 200. We used the FBS4 version with I = J = 2b prototypes and 
took the k = 3 nearest neighbors for combination. Here, the output prototypes 
6i are calculated as output values of the Ci by means of a Nad araya- Watson 
regression method using the training set. The interquartile interval from the 
calculated pignistic distribution BetP is shown in Figure 1. These results are 
compared to a model based on fuzzy systems proposed by Yager in [20]. The 
regions of low density are well reflected by a large value of nonspecihcity and 
ignorance m{y) and a very large interquartile interval. On the contrary. Yager’s 
model does not rehect well this uncertainty, since the nonspecihcity does not 
increase in those regions. As expected, and contrary to the results in [6], the 
regions of large variance are characterized by a larger conhdence interval and a 
greater nonspecihcity than regions of small variance. 

In order to be sure that our method was efficient in terms of accuracy, we 
compared it with several well-known regression methods (cf. Figure 2). Table 
1 shows the mean squared errors for two classical non parametric methods, 
the Smoothing Cubic Splines (CS) and the Nad araya- Watson kernel regressor 
(NW), a memory-based method, called the “lazy” method [2], Yager’s model 
and two versions or our method: FBS4 and the simplihed one (FBS2). The 
MSE have been calculated only in the region of high density for two cases: the 
hrst one corresponds to the region of well specihed data, with small variance 
(x G [—2.5, —0.7]). The other one also includes the region with high variance 
(x G [—2.5, —0.7] U [0.8, 2.6]). As shown in Table 1, the performances of our 
method (FBS2 or FBS4) are roughly equivalent to those of standard methods 
when applied to this classical problem. However, one can notice that the spline 
methods seem to be better adapted to the case of homogeneous data, in this 
example. However, this global method cannot cope with heterogeneous data, as 
our method seem to do. 
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Fig. 1. Example 1. Left: real data (-); estimates (-); interquartile interval of BetP(y) 
(- -); training set (+) ; Right: nonspecificity with FBS4 (-) and with Yager’s method 
(- -), and mass m{y) assigned to the whole space by FBS4. 



FBS2 estimate ‘Lazy' estimate 





Nadaraya watson estimate Smoothing cubic spiines estimate 





Fig. 2. Example 1.: training set (.); real data (-); punctual estimation for various 
models (-) 



Regression method 


CS 


NW 


“Lazy” 


Yager 


FBS2 


FBS4 


MSE 2 


0.035 


0.016 


0.019 


0.010 


0.013 


0.011 


MSE f 


10“^ 


0.003 


0.013 


0.008 


0.011 


0.010 



Table 1. Mean squared error for various models: MSE f for x G [—2.5,— 0.7], MSE 2 
for X G [-2.5, -0.7] U [0.8, 2.6]. 
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Example 2 : We consider a set of input-output data generated from the Monod 
equation y = + e{x),x E [0,2], where e(x) ~ 7V(0, 0.001) if * < 1 and 

e(x) ~ 7V(0, O.Ol)"" if * > 1. If * e [0.5, 1.5[, (a, h) = (.35, .5), (a, h) = (.35, .5) 
or (.45, .45) with the same probability pn = pi 2 = .5 and (a,&) = (.35, .5) if 
x E [0, 2]\[0.5, 1.5[ (see Figure 3). We used the FBS4 version, with I = 20, J = 
30 and k = 3 neighbors, and the prototypes were generated as in Example 1, 
using the two target functions, depending on the parameter values a and h. We 
used the “weighted average” version of our simplihcation method with different 
aggregation thresholds d during combination. When there is no simplihcation, 
d = 1. We note that even in this extreme case, the number of focal elements 
|F’(m)| is much less than 2^*^ since most estimated probabilities pf) are equal to 
0. In the other extreme case, d = 0 and all the focal elements are aggregated 
(i.e. |F’(m)| = 1) reducing the FBS as a simple fuzzy set. Figure 4 shows that 
|F’(m)| increases in the regions of output discord and quickly decreases with d. 
In Figures (3-4), we also show some results for a special case where d = 0.5. The 
4 regions can be clearly characterized by the uncertainty measures. The regions 
(1-2) [x E [0, 1]) of larger variance are characterized by a larger nonspecihcity 
and the conhict in the output value is well represented by a larger strife in 
regions (2-3) [x E [0.5, 1.5]). The pignistic probability calculated for 4 typical 
values [x = 0.2; 0.6; 1.3; 1.6) summaries well the four encountered situations. 



Nonspecificity 




X X 



Fig. 3. Example 2. Left: training set (-f); point estimation (-), prototypes (o) and 
classes (:) Right: nonspecificity and strife with FBS4 (-). 



6 Conclusion 

In this paper, a new approach to regression based on fuzzy evidence theory 
has been proposed. This approach allows to take into account different kinds 
of output data, such as intervals, fuzzy numbers, or more generally fuzzy belief 
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Pignistic distribution 




Fig. 4. Example 2. Left: pignistic distribution for 4 different inputs: x = 0.2 - 0.6 
-1.3 -1.6 (-); real value (-); point estimation (-) Right: Number of focal elements for 
different values of S: 1 (-), 0.5(-), 0.2 (-.-) and 0 (-f). 



structures. In terms of accuracy or point estimation, our method can also cope 
with classical regression problems as well as statistical methods. Moreover, since 
it can be viewed as a local functional approximation method, estimations can 
even be more accurate than those provided by global models. Thus, this method 
can be seen as a generalisation of regression analysis for non standard data. Two 
kinds of output data information are offered: a representation of uncertainty 
that can take several forms (imprecision, indetermination, ignorance) and point 
estimation accuracy. A well-known drawback of evidence theory based meth- 
ods is that they are generally time consuming due the exponential increase of 
focal elements encountered when combining large numbers of belief structures. 
This problem can be minimized by clustering training data, or by considering 
only neighboring information in the learning set. In our method, we propose 
to approximate the belief structures by aggregating similar focal elements. This 
simplihcation method can be used for two different objectives. During combina- 
tion, computing time is decreased and a/ter combination, the interpretability of 
the results is enhanced. 
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Abstract. A method is proposed for determining the state of a dynam- 
ical system modeled by a Petri net, using observations of its inputs. The 
initial state of the system may be totally or partially unknown, and sen- 
sor reports may be uncertain. In previous work, a belief Petri net model 
using the formalism of evidence theory was defined, and the resolution 
of the system was done heuristically by adapting the classical evolution 
equations of Petri nets. In this paper, a more principled approach based 
on the Transferable Belief Model is adopted, leading to simpler computa- 
tions. An example taken from an intelligent vehicle application illustrates 
the method throughout the paper. 



1 Preliminaries 

1.1 Introduction 

The problem addressed in this paper concerns the determination of the state of a 
dynamical system, using sensor reports. The sequential evolution of the system is 
modeled by a simple class of Petri nets [1-3] composed of a set P = {pi, . . . ,p„} 
of places, a set T = {ti, . . . ,tq} of transitions or logical propositions, and a set 
F C (R X T) U (T X R) of arcs connecting a place to a transition, or a transition 
to a place (each transition is assumed to have only one output place, and each 
place has at most one output transition). At each time step, one of the n places 
is marked by a token. A move of the token from place p to place p' occurs if there 
is a transition t with input place p and output place p' , and if this transition has 
truth value 1. 

This formalism may be used to model the behavior of certain physical sys- 
tems, the state Xk of the system at each time step k being described by the 
marking of the Petri net at time k. In the type of applications considered here, 
the initial state of the system and/or the truth value of the transitions are only 
partially known, and the goal is to determine as accurately as possible the actual 
system state. 

Our approach is illustrated throughout this paper by a typical example taken 
from an intelligent vehicle application [4]. 



A. Hunter and S. Parsons (Eds.): ECSQARU’99, LNAI 1638, pp. 352-361, 1999. 
© Springer-Verlag Berlin Heidelberg 1999 




State Recognition in Discrete Dynamical Systems Using Petri Nets 



353 



1.2 Scenario Description 

The goal is to detect and characterize an overtaking maneuver on a highway 
composed of two one-way lanes, the dynamic motion of the vehicle being observed 
by proprioceptive sensors such as accelerometers, steering wheel angle sensors 
and braking sensors. An experimental vehicle (EV) goes on the right lane of a 
highway. It catches a target vehicle (TV) going on the same lane with a lower 
speed. The EV is beginning an overtaking of the TV. It begins to go left for lane 
changing, then goes straight forward. When TV is overtaken, it goes right to the 
right lane. 

The recognition of the phases of such a maneuver can be used, for instance, 
to run a lateral target detection sensor when EV is on the left lane, the obtained 
information being useless in the other phases of the maneuver. The overtaking 
maneuver can also be stopped during its execution, for instance if EV establishes 
that it is impossible to overtake TV before the exit he wants to take. Note that 
no temporal information is used, because the duration of each phase depends 
on the context of the maneuver (speed, length of TV, etc.) and cannot be easily 
evaluated. 

A Petri net for this problem is shown in Eigure 1. It is composed of 5 places 
and 4 transitions, the interpretations of which are given in Table 1. In this 
model, the maneuver is described as a sequence of token positions. The token 
is initially in place pi, and ends up in place p^. It is removed from a place 
p and added to place p' if there is a transition t connecting these two places, 
which has truth value equal to 1. These truth values are deduced from numerical 
measurements provided by the sensors, which requires some form of numerical to 
symbolic conversion [5]. Eor instance a report such as “lateral speed = 0,1 m/s” 
is transformed into a degree of conhdence in the proposition “positive lateral 
speed and positive steering wheels angle” . Eurthermore, measurement errors and 
the possibility of sensor faults are additional sources of uncertainty. 

Note that some transitions have the same expression, for instance ^2 and ^ 4 . 



pi tl p2 t2 p3 t3 p4 t4 p5 




Fig. 1. Petri net for the overtaking maneuver. Places are drawn as circles, and transi- 
tions as boxes. The token is in place p2. 



1.3 Formalization 

The state of a Petri net at a given time step k is dehned by a marking 
assigning an integer to each place. The marks take values in {0, 1} when the net 
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Table 1. Interpretation of places and transition in the overtaking maneuver example. 



Pi Initial state 
P2 Left lane change 
P3 Overtaking 
P4 Right lane change 
ps Final state 

ti positive lateral speed and positive steering wheels angle 

t2 small lateral speed and positive or small longitudinal acceleration 

ts negative lateral speed and negative steering wheels angle 

small lateral speed and positive or small longitudinal acceleration 



is a state machine [3]. The marking at time k may then be represented by a 
vector = [m^ {pi), . . . , G {0, 1}". For instance, we may have: 

M'' = [0 1 0 0 0]^, 

meaning that the EV is moving to the left lane. Equivalently, the state of the 
system at time k may be described by a variable taking values in P . We then 
have 

= Pi o Mf = 1. 

Similarly, the truth values of the transitions at time k may be described by 
a vector G {0, 1}^, such as: 

i?'' = [0 1 0 1]^, 

meaning that the lateral speed is very small, and the longitudinal speed is posi- 
tive. 

The evolution of the net depends on its structure and on the validity of the 
transitions. Let R denote the vector of truth values of the transitions. If R is 
known, the marking at time -f 1 is completely determined by the marking at 
time k. We can then define a transition function 

f : P X {0,1}’^ ^ P 

such that /(p, R) = p' if Ri = 1 for some transition ti connecting p to p' , and 
/(p, R'j = p otherwise. For instance, with i? = [0 1 0 1]^, we have /(p 2 , R) = Ps 
and /(p 3 , R) = Ps- The states X^ and X^~^^ of the system at times k and k + 1 
are therefore related by the following equation: 

Xk+i ^ f(^x'=,R'=). 

Although the above model does not include any form of uncertainty, it is 
interesting to give some of the above notions a probabilistic interpretation, which 
will become useful in later developments. Since (p,) = 1 means that the token 
is for sure in place pp (p, ) may be interpreted as the probability, taking value 
in {0, 1}, that X^ = p,-. In the same way, let p be an input place of transition 
ti, and p' be its output place. Then, R!t may be interpreted as the conditional 
probability that = p', given that = p. 
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2 Modeling Uncertain Knowledge of State 

The belief Petri net model [1] was introduced to deal with situations in which 
the goal is to identify the state of a dynamical system at time k with partial 
or no a priori knowledge of its initial state, and/or when the transitions are 
uncertain. The structure of the net is assumed to be known, and is the same as 
considered in the previous section. Alternative approaches based on fuzzy sets 
and possibility theory are described in [6]. 

2.1 The Belief Petri Net Model 

Belief Petri nets use the Transferable Belief Model (TBM) [7], a subjectivist 
interpretation of Dempster-Shafer theory, to quantify one’s belief concerning the 
state of the system at each time step. More precisely, one’s belief regarding the 
value of variable is described by a basic belief assignment (BBA) , i.e., a 
function from 2^ to [0, 1] verifying 

Y,m\A) = l. 

ACP 

For any set A of places, (A) is interpreted as the portion of a unit mass of 
belief, that one is willing to commit to the hypothesis that E A. By analogy 
with the standard Petri net model described above, we may dehne an extended 
marking vector = [m^ (^i)i • • • i (^ 2 ’*)]^ G [0, 1]^ , where Ai, . . . , A'jn are 
the 2" subsets of places. In this model, the marking thus takes the form of a 
distribution of mass on the power set of F. 

2.2 Computation of Beliefs 

Let us now assume one’s belief regarding X^ to be quantihed by a BBA pA , 
and let denote the vector of truth values of the transitions at time k, which 
is considered to be known. A BBA describing one’s belief concerning the 

state of the system at the next time step may be computed by transferring 

the mass irA (A) to the set 

f{A,Rk)=[^ f{p,Rk), (1) 

peA 

for all A C P. We thus have 

nrA [A) 

{ACP\S{A,R>‘)=B} 

Example 1 In the overtaking example, assume that 

"j''({pi,P2}) = 0.7 
m'' ({pi,P2,P3}) = 0.2 
m*’(P) = 0.1, 



( 2 ) 
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and = [0 1 0 1]^, Using Eg. (1), we have: 

f[{Pl,P2},R^) = {Pl,P3} 
f[{Pl,P2,P3},R^) = {Pl,P3} 

f{P, R^) = {Pi,P3,Pr,}- 



Efence, Eg. (2) yields 

m''+^({pi,P3}) = m'' ({pi,p2}) + m'' ({pi,p2,P3}) = 0.9 
m''+^({pi,P3,P5}) = m^{P) = 0.1 

Note that the mass of belief is concentrated on smaller subsets at time k -\-l. 
This phenomenon is general, since 

\f{A,R>^)\<\A\ WACP. 

An immediate consequence is that is more informative than in terms 

of non specihcity [8], i.e., 

N{m^+^) < N{m^) 

where N [m) is the nonspecihcity of m, dehned by: 

N [m) = ^ m(A)log2(|A|). 

ACP,A^$ 

The difference N [m^) may then be interpreted as a measure of the 

information gained by observing the system at time step k (and contained in 

R^). 

3 Uncertain Knowledge of Transitions 

In the class of applications considered in this paper, the truth values of the tran- 
sitions are usually deduced from sensor measurements. Because of the limited 
precision and reliability of sensors, the truth values of the transitions are usually 
not known with certainty. In the belief Petri net model, this uncertainty is rep- 
resented by belief functions describing one’s uncertain knowledge of the truth 
value of each transition. More precisely, let denote the BBA regarding the 
truth value of transition ti at time k. Its frame of discernment is 12 = {0,1}. By 
analogy with the previous notations, the belief masses m}(A) for i = 1 ,. . ,,q 
and A C 12 may be presented in a matrix of size (g, 3), such that 

^"i="*"({0|) i??.2 = mf({lj) i?f_3 = mf({0,l|) Vie{l,...,gj. 

The problem is now to combine one’s knowledge of the state of system at time 
k, represented by , with one’s knowledge of the transitions at k, represented 
by the for i = 1, . . .,g, to arrive at a BBA quantifying one’s belief 

regarding the state of the system at time k + 1. This problem may be solved in 
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the TBM framework by applying the Generalized Bayes Theorem (GBT) [9], in 
the following way. 

First of all, let us remark that one’s beliefs about ti at time k may be trans- 
lated into conditional beliefs about given . Let ti be a transition with 

input place p and output place p' . If = p, then X^~^^ = p if ti = 0, and 
X^+i — p! otherwise. Hence, one’s beliefs about conditionally on X^ be- 
ing equal to p, may be deduced from mj’. If (yl|{p}) denotes the mass of 

belief assigned to the hypothesis ^ A C P, given that X^ = p, we then 

have: 



{{p}\{p}) = m'y {{0}) (3) 

(4) 

= mf({0, 1}). (5) 

Thus, we know how to express our beliefs on given that X^ = p. 



What are now our beliefs on if we only know that X^ E A G P 1 

The disjunctive rule of combination [9] provides an answer to this question. In 
general, the disjunctive sum of two BBA’s m and m' dehned on the same frame 
of discernment P is dehned as: 

(mUm')(yl)= m[B)m' [C) MACP. 

{B,CCP\B^JC=A} 

This rule is appropriate to combine information coming from distinct sources, of 
which at least one is known to tell the truth. Here, may be obtained 

as the disjunctive sum of for a\\ p E A: 

^fe+i|fe(.|^) ^ [J ^fe+i|fe(.|{p}) VHCH. (6) 

peA 

Finally, the GBT allows to combine one’s beliefs regarding X^ , with one’s 
beliefs regarding conditionally on X^ , as 

m*’+i(T) = ^ MACP. (7) 

BCP 

Example 2 In the overtaking maneuver example, assume to be defined as 
in Example 1, and to he: 






0.0 0.9 0.1 
0.3 0.0 0.7 
0.0 0.7 0.3 
0.3 0.0 0.7 



These numbers may be translated into the following conditional belief numbers: 

= mi({0}) = 0.9 




358 



Michele Rombaut, Iman Jarkass, and Thierry Denoeux 



^fe+i|fe({pi P2}|{pi}) = mj({0, 1}) = 0.1 
„^'=+i|fe({p3}|{p2}) = = 0.3 

r„'=+i|fe({p2,p3}|{p2}) = m^({0, 1}) = 0.7 
„^'=+i|fe({p3}|{p3}) = m^({0}) = 0.7 
„^'=+i|fe({p3 P4}|{P3}) = ml{{Q, 1}) = 0.3 
m''+i|''({p5}|{p4}) = m\{{l}) = 0,3 
({Pa,P5]\{pa]) = 1}) = 0.7 

Additionally, as there is no transition from p^, we have: 

m''+i|''({p5}|{p5}) = 1. 

Since has three focal elements: {pi,P 2 }, { 7 'i; 7 ' 2 ,P 3 } ond P, we must 
compute three conditional BBA’s {'\{pi, P 2 , Ps}) ond 

f\P). From Eq. (6), we have: 

which leads to: 

^fe+i|fe({pi P3}|{pi P 2 }) = m''+i|''({pi}|{pi})m''+i|''({p3}|{p2}) 

= 0.9 X 0.3 = 0.27 

m^+^\^{{pi,P2,P3]\{pi,P2]) = {{pi,P2]\{pi])-m^""^''^ {{P3]\{P2]) + 

^fe+i|fe({pi}|{pi})^fe+i|fe({p2,p3}|{p2}) + 

+ ({P4 ^ P 2 } + {{P2 , Ps} |{P2}) 

= 0.1 X 0.3 + 0.9 X 0.7 + 0.1 X 0.7 = 0.73 

Similarly, 

^*^+i|*^(.|{Pi^P2,_P3}) = {Mpi}) U (MP 2 }) U (■\{ps}) 

leads to: 

^fe+i|fe({pi P3}|{pi P2 ,p 3}) = 0.189 
^fe + l|fe({pi P2,P3 }|{pi,P2,P3}) = 0.511 

^fe + l|fe({pi P3 P4}|{pi P2,P3}) = 0.081 

^fe + l|fe({pi P2,P3,P4 }|{pi,P2,7>3}) = 0.219. 

Finally^ 

5 



{{pi,P3,P5}\P) = 0.0567 
^fe+i|fe({pi P2 ,p3,p 5}|+) = 0.1533 
^fe+i|fe({pi P3 P4 P5}|P) _ 0.2133 
m^+^\^{P\P) = 0.5766 



and we have 
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The BBA quantifying one’s beliefs about the state at k + I may finally be 
obtained by applying Eg. (7). For example, we have: 

= 0.7 X 0.27+ 0.2 X 0.189 = 0.2268 



Similarly, we obtain: 

m''+^({pi,P2,P3}) = 0.6132 

m''+^({pi,P3,P4}) = 0.0162 

m'"~^^{{pi,P2,P3,P4}) = 0.0438 
«t''+^({pi,_P3,P5}) = 0.00567 
m''+^{{pi,P2,P3,P5}) = 0.01533 
m''+^{{pi,P3,P4,P5}) = 0.02133 

^fe+i(p) ^ 0.05766 



4 Simulation Results 

Some simulated tests have been made, such as represented in Figure 2. In this 
example, the initial state is known and corresponds to place pi. The belief masses 
nif associated to the transitions are represented in Figure 3. 

These results show that, when the truth value of a transition is not sure, the 
mark changes from one place p to the next p' through the proposition {p,p'} 
composed of the two places (see, e.g., the transition from place 2 to place 3 at 
time step 4). This phenomenon can be seen as some form of “gradual” transition. 

5 Conclusion 

In this paper, a method has been proposed to recognize the state of a sequential 
system modeled by a Petri net, using uncertain observations. This method is 
based on the Transferable Belief Model, in which belief functions are used to 
represent imperfect knowledge. This approach allows to deal with both partial (or 
total) ignorance of the initial system state, and limited precision and reliability 
of sensor data. The use of the Generalized Bayes Theorem allows to drastically 
reduce the amount of computing time in this method, as compared to previous 
approaches [1]. 

The application of the proposed framework to the development of a driv- 
ing assistance system is currently underway. Real measurements have been per- 
formed on an experimental vehicle, and the current problem is to transform 
numerical data into truth values of logical propositions associated to the tran- 
sitions. The combined use of fuzzy logic and evidence theory might be a useful 
approach to this problem. 
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Fig. 2. Simulated example, showing the belief masses m^(A) as a function of k for 
nine consecutive time steps. The corresponding belief masses mf associated to the 
transitions are shown in Figure 3. 




Fig. 3. Belief masses mf({f}) (left) and mf({0}) (right) associated to each of the four 
transitions, for nine consecutive time steps. 
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Abstract. This paper presents a programmable logic-based agent con- 
trol system that interleaves planning, plan execution and perception. In 
this system, a program is a collection of logical formulae describing the 
agent’s relationship to its environment. Two such programs for a mobile 
robot are described — one for navigation and one for map building — that 
share much of their code. 
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Abstract. In this paper we study methods for predicting the stock in- 
dex DAX. The idea is to use the information provided by several differ- 
ent information sources. We consider two different types of information 
sources: 1. Human experts who formulate their knowledge in form of 
rules, and 2. Databases of objective measurable time series of financial 
parameters. It is shown how to fuse these different types of knowledge by 
using neuro-fuzzy methods. We present experimental results that demon- 
strate the usefulness of these new concepts. 



1 Description of the Stock Index Prediction Problem 

The recently growing dynamics of the international stock markets has made port- 
folios consisting of such assets more attractive than bonds or other investments. 
In this article, we focus on the German stock index DAX because the 30 compa- 
nies forming the index are responsible for about 70% of the turnover at the stock 
market in Frankfurt. Since the DAX is a weighted mixture of stocks it behaves 
like a real portfolio. The task is the prediction of the monthly returns (relative 
differences) of the DAX given historical data and some fuzzy rules formulated 
by experts. 



1.1 Selection of timeseries and preprocessing 

In table 1 a selection of financial time series used for monthly prediction of 
the DAX is listed. These variables describe a so called fundamental model. All 
patterns are stored at the end of each month. For optimization the available 
data set has to be divided into a training and a test set. The given inputs are 
preprocessed by computing the relative differences, i. e. 

= 100 - l) , (1) 

A. Hunter and S. Parsons (Eds.): ECSQARU’99, LNAI 1638, pp. 363-373, 1999. 
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timeseries 


variable 


Shortrate Germany 
DM/US-Dollar 
Dow Jones Index 
US-Treasury Bonds 
Morgan Stanley-Index Europe 


shortrate 
usd 
dow 
bonds 
ms euro 



Table 1. Selection of time series. As one can see the DAX itself is not used for predic- 
tion. 



where usdt represents the DM/US-Dollar exchange-rate at time step t. The index 
t = 2, . . . ,P represents the time step from which the DAX has to be predicted. 
The input vector x{t) is defined by 

x(t) = (shortrate(t),dow(t),usd(t),bonds(t),mseuro(t)) . (2) 

The target value is computed by 

y{t) = dax{t + 1) = 100 - l) • (3) 

In this paper we restrict ourselves to consider only 5 time series. In the real 
application we used more than 20 time series. 

1.2 Expert knowledge 

In finance, there are many traders (experts) who are able to express their expe- 
rience in form of fuzzy rules. In our example experts formulated sets of rules of 
the form 



Rule R: 



IF usd(t) = increasing 

AND bonds{t) = decreasing 

THEN dax{t + 1) = increasing 

WITH WEIGHTING k . 



The value /t € [0, 1] represents a rule weight given by the expert. This value 
measures the “importance” of a rule. In the real application we used about 100 
rules of different experts. 



2 Formalization of the problem 

Suppose for simplicity that we have only two experts who provide rule sets i?i 
and i ?2 consisting of weighted fuzzy rules of the type described above. The data 
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set is denoted by D. So we have three different information sources which provide 
the information i?i, i? 2 , and D, respectively. There are several general methods 
to process these information, in our example we studied mainly three operator 
schemes: 



fuse{R,R') fuse two rule sets R and R' 

induce{D) induce a rule set from a given data set D (4) 

revise{R, D) revise a rule set R in the light of a data set D 

In the different uncertainty handling calculi we find many manifestations of 
these three schemes: The simplest way to fuse any rule sets R and R’ is just to 
calculate their union, but in many cases other techniques are used, especially 
when we have weighted rule sets. Here the choice of good rule weights for the 
fused rule sets is a delicate task. Currently, the induction of rule sets is a hot 
topic: In the field of fuzzy systems often neural nets techniques [10] or methods 
adopted from fuzzy clustering are applied [7] . But also Bayesian techniques are 
useful if information about the dependencies between rules can be provided 
(which was not the case in our example). There are also several methods for 
revising rule sets in the light of a data set known in literature. One can cancel 
rules if they turn out to be not reliable, one can change the rule weights, one 
can delete premises of rules, etc. These techniques are also known in Bayesian 
statistics, where a given prior probability distribution is revised by additional 
evidence to a posterior distribution [1,3]. 

In our example there are several possibilities to compose fusion operators. We 
could, for example, fuse the information given by the experts and then revise this 
common knowledge by using the given data set. A short hand notation of this 
strategy is revise{fuse{Ri , R2) , D) . With this notation, we can identify several 
fusion strategies such as 



fuse(fuse(Ri,R 2 ),induce(D)) , (5) 

fuse(revise(Ri,D),revise(R 2 ,D)), and (6) 

fuse(Rl,revise(R 2 ,D)) . (7) 

Other techniques are well-known from applications, for instance, to split the data 
set in two disjoint subsets R>i and R> 2 , and then to apply the strategy 

revise(revise(fuse(Ri,R 2 ),Di)D 2 ) ■ (8) 

In many cases, the given expert rules are used to clean the data. 

For testing the different fusion strategies we chose neural-fuzzy methods. 
These techniques turned out to be adequate for representing the knowledge given 
by the experts, and they also served as a convenient platform for expressing and 
testing the different operator schemes and several fusion strategies. Of course 
it is also possible to use other techniques, but the main advantage for us was 
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the simplicity of the methods as well as the availability of software tools that 
enabled us to test our ideas. For a conceptual and comparative discussion of 
fusion strategies in various calculi of uncertainty modelling, we refer to [4]. 

3 A Neuro-Fuzzy Solution 

In this section we shortly describe how neuro-fuzzy-methods can be used to fuse 
data and expert knowledge, which is formulated by a set of rules like described 
before. The presented methods are the results of intensive researches on time 
series prediction of a group of Siemens. 

The linguistic expressions, e. g. increasing and decreasing, are transformed 
into fuzzy sets using one-dimensional membership functions, e. g. Gaussian or 
logistic functions: 

/xe,.(a:)=exp(-i(^)") . 

= (1 + exp (-4s(a; - c)))“^ . 

The location of a membership function is determined by the parameter c, which 
represents the mean of the Gaussian resp. the turning point of the logistic func- 
tion. The scope of a membership function depends on s, which is the inverse 
of the deviation of a Gaussian membership function, resp. is proportional to 
the slope at the turning point of the logistic function. Figure 1 shows a fuzzy 
partition with three fuzzy sets. 




Fig. 1. Gaussian membership function in combination with logistic membership func- 
tions. 



Like Tresp et. al. [16], Takagi, Sugeno [15] and others [17] we transform a 
rule base consisting of a finite number of fuzzy rules R\, ... ,Rr into a neural 
network of the form described in figure 2. Each rule Ri corresponds to a basis 
function /x/ (i.e. neuron) in the second hidden layer. The output y for a given 
input vector ^ = (a;i , . . . , a;„) is the weighted average of the conclusions wi where 
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the weighting factor depends on the activity level of the rules or basis functions 
jjii due to the input 

^M,y = W{x) = Twi ^,'"^ . . ( 10 ) 

The rule activity is computed by 

W : [0, 1]” ^ [0, !]'■ ,Mx) = , (11) 

where Di represents the number of linguistic terms used in rule Ri [13]. The term 
represents the membership function belonging to the premise j (linguistic 
expression) and input a;,. 

The adaptable parameters of the presented neuro-fuzzy-model are the param- 
eter of the fuzzy sets c and s, the rule weights k and the conclusion weights w. 
Additionally the structure of the rules can be improved using data. The structure 
is presented by the weights of the connection between the first and the second 
hidden layer (see figure 2). The parameters of the fuzzy sets are stored in the 
weights between the input and the first hidden layer. The conclusion weights are 
represented by the weights of the connection between the second hidden layer 
and the output layer. The rule weights k are positive real numbers with their 
sum being equal to some a priori fixed constant. In this way unimportant or 
inconsistent rules are pruned automatically as the learning algorithm drives k of 
such rules to zero [11,16]. The detailed optimization procedure is described in 
[13]. Note that we used a constraint learning algorithm. Our technique preserves 
the initial semantic given by the relations among the fuzzy expressions in the rule 
base. The semantic-preserving learning algorithm ensures that reinterpretation 
after training is always possible and can give useful insights for an improved sys- 
tem understanding. In addition, the constraints reduce the effective number of 
parameters which can avoid overfitting. Because of the low parameter-data ratio 
(simple network structure) and the constraint learning algorithm a neuro-fuzzy- 
systems generates a better out of sample behavior than a comparable neural 
network [2]. 

4 Results 

After transforming the given rules into the neuro-fuzzy-architecture the rule 
weights and the parameters determining shape and form of fuzzy sets are op- 
timized using the available data. After one training cycle the parameters are 
adapted in a direction computed by the learning algorithm. Before the new 
training cycle starts we have to insure that the user given constraints are ful- 
filled. If all constraints can not be solved within a given number of iteration a 
warning is given to the user that the rule base is not consistent with the data. 
Then one can change the rules or release some constraints. In our experiments 
the algorithm needs rarely more than one iteration and has always converged. 
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Fig. 2. The neuro-fuzzy architecture. 



For the experiments, we used the neural network simulator SENN (Software 
environment for neural networks) of Siemens, because it integrates a neuro- 
fuzzy-simulator and all needed methods like the semantic-preserving learning 
algorithm. For more information see http://www.senn.sni.de. 

The concept of neuro-fuzzy-methods is also described in [8, 10]. 

The benchmarks which are used to compare the performance of the generated 
models are the following trading systems respectively models: 

1. Buy&Hold: buy the DAX at the beginning of the test set and sell it at the 
end. This strategy assumes an efficient capital market which does not allow 
excess return because the conditional expectation of the returns are zero. 
The Buy&Hold strategy gains only by exploiting the market trend. 

2. naive prediction: buy or hold the DAX if the last difference is positive and 
sell otherwise. The naive prediction assumes that the market behaves like a 
random walk. In the long run the naiv prediction has an exspected return 
of zero. 

3. linear model: A linear regression model using the given input vector. 

A comparison with a sophisticated model created by a special neural network 
[12] can be found in [14]. It turns out, that the results from the point of view of 
accuracy are similiar but the neuro-fuzzy-approach is still interpretable. For each 
pattern only a small set of rules is active and the user is able to “understand” 
the output of the network. 

The performance of the strategies are compared by computing different mea- 
surements which are described in the following section. 



Return on Invest: The return on invest (roi) measures the return of a trading 
system by 



roi = Ei=i 



/ daxt+i 
^ daxt 



1) sign(NN(^(t)) , 



(12) 
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where NN(^(f)) is the output of the network for time step t and sign(NN(^(t)) 
is the predicted direction of the DAX. roi increases if the predicted direction is 
correct, otherwise it decreases. The annualized return on invest is computed by 

• 12 . 

roia = -^roi . (13) 

Hitrate: The hit rate (HR) measures, how often the sign of the change of the 
DAX (increase or decrease )is correctly predicted: 

1 P 

HR = — ^ sign(daxt+i - dax^) sign(NN(x(t)) (14) 

t=i 



Root mean squared error: The root mean squared error i?rmse computes 
the difference between the target and the output of a model, where high and low 
errors are weighted equally: 



Si 



rmse — 



'J 



^J2^m(x(t),w)-y(t)y 



t=i 



The following neuro-fuzzy-models are evaluated: 



— iniExp: initial expert with identical rule weights. 

— traink: expert with optimized rule weights. 

— trainPk: expert after optimization of fuzzy sets and rule weights. 

— prunek: expert after deletion of some rules. 



(15) 



Each model represents one step in the optimization procedure. The last model 
is the optimized model. 

In table 2 the results of different models on training set after are presented. 
The results on test set are displayed in table 3. The behavior of the output- 
target-curve of the optimized model is shown in figure 3. At the end of the data 
set one can see a strong increasing market and the “Crash” of the DAX. The 
behavior during this volatil market is very encouraging because the sign of the 
predicted direction is always correct. 

The optimized rule base is presented in table 4. Only nine rules are left and 
due to the semantic-preserving learning-algorithm the fuzzy-sets still represent 
the linguistic expressions in a correct way. Rule R 3 represents the so called 
“Default” -Rule which determines the ouput of the model, if no other rule meets 
the actual input pattern. 
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Buy&Hold 


naiv 


linear 


iniExp 


traink 


trainPk 


prunek 


Ermse 


- 


7.31 


3.91 


5.15 


4.22 


2.74 


2.56 


roi 


19.90 


-31.91 


57.02 


27.97 


78.23 


84.17 


84.17 


roia 


9.55 


-15.32 


26.32 


12.91 


37.54 


40.40 


40.40 


paliil 


0.21 


-0.35 


0.61 


0.30 


0.85 


0.91 


0.91 


HR 


0.64 


0.40 


0.65 


0.62 


0.80 


0.84 


0.84 



Table 2. Evaluation: Buy&Hold, naiv prediction, linear model, initial fuzzy expert 
and optimized neuro-fuzzy-models on training set. 





Buy&Hold 


naiv 


linear 


iniExp 


traink 


trainPk 


prunek 


Ermse 


- 


9.38 


6.31 


5.84 


5.16 


5.15 


4.88 


roi 


77.23 




51.42 


85.72 


104.53 


87.80 


115.13 


roia 


37.07 




24.68 


41.14 


50.18 


42.14 


55.26 


paM 


0.60 




lEHI 


0.66 


0.80 


0.68 


0.89 


HR 


0.88 


n 


m 


0.88 


0.92 


0.80 


0.92 



Table 3. Evaluation: Buy&Hold, naiv prediction, linear model, initial fuzzy expert 
and optimized neuro-fuzzy-models on test set. 



NEFUPSS-KO (prunek): Comparisson Oulput/Target 




Fig. 3. Comparison Output-Target of opti- 
mized model after pruning rules (prunek). 
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no 


Rule 


weight 


Ri 


IF shortrate = increasing 

AND usd = stable 

THEN dax = decrease 


1.0000 


R2 


IF usd = stable 

AND bonds = decreasing 

AND mseuro = increasing 

THEN dax = increase 


0.5034 


Rs 


IF 

THEN dax = stable 


1.0000 


Ri 


IF dow = increasing 

AND usd = decreasing 

THEN dax = decrease 


0.4900 


i?6 


IF bonds = stable 

THEN dax = increase 


0.3417 


Re 


IF mseuro = increasing 

AND shortrate = stable 
AND dow = increasing 

THEN dax = decrease 


0.3096 


Rr 


IF mseuro = increasing 

AND shortrate = decreasing 
THEN dax = increase 


0.4369 


Rs 


IF dow = increasing 

AND shortrate = increasing 
THEN dax = decrease 


0.4911 


Rg 


IF dow = stable 

AND bonds = decreasing 

THEN dax = increase 


0.3543 



Table 4. Optimized rulebase. 
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5 Summary 

We have presented an effective way to fuse data and rule knowledge. Rules for- 
mulated by an expert using lingusitic expressions are transformed into a special 
neural network. The parameter of the neural network represents the linguistic 
terms and are optimized using the availabe historical data. Due to the semantic- 
preserving learning-algorithm the fuzzy-sets and rules still representing the lin- 
guistic expressions in a correct way. Therefore a analyzis of the optimized rule 
base can give useful insights into the modelled process. Because of the con- 
straints needed for semantic preserving the overfitting effect is reduced and the 
performance on unknown data is improved. 

As an example we predict the monthly returns of the German Stock Index 
DAX. The resulting model consists of only nine rules and the performance is 
very covinient. A detailed description of the neuro-fuzzy-concept is presented 
in [13]. Other studies of financial applications could be found in e.g. [12,18,5, 
6,9]. The advantage of the presented neuro-fuzzy approach is the possibility to 
integrate and extract information without loss of performance. 

The presented methods are part of the neural network simulator SENN of 
Siemens. Successful applications are the prediction of the German Stock Index 
DAX and the prediction of the DM/US-Dollar exchange rate. Additionally the 
neuro-fuzzy-concept is used for detection of errors in telecomunication. 
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Abstract. In this paper we study defeasible goals in Gabbay’s labelled 
deductive systems. We prove the completeness of a simple and elegant 
proof theory for the labelled logic of defeasible goals by proving two 
phasing theorems. 



1 Introduction 

In the usual approaches to planning in AI, a planning agent is provided with 
a description of some state of affairs, a goal state, and charged with the task 
of discovering (or performing) some sequence of actions to achieve that goal. 
Recently several logics for goals (and the closely related desires) have been pro- 
posed [4,3,12,2,14,13,8,6,15,16]. We are interested in a logical system that 
tells us which conditional goals can be derived from a set of conditional goals 
(the set of premises) called the goal base. In this simple setting - in which we do 
not consider logical connectives between the conditional goals - reasoning about 
goals is non-trivial for the following three reasons. 

Goals can conflict. Some students desire to do research and desire to be rich, 
although they realize that they cannot achieve both goals simultaneously. 
Goals are context dependent. If the robot’s goal to get me some coffee can- 
not be achieved, for example because the coffee machine is broken, then its 
alternative goal may be to get me some tea. In preference-based approaches 
this goal is a preference of tea without coffee over neither tea nor coffee. 
Goals can be overridden by other goals. Bacchus and Grove’s [1] surgery 
and marriage examples below illustrate that the addition of a goal to the 
goal base may result in new goals coming into force and, more importantly, 
old goals losing their force. 

Example 1: surgery. A person may prefer not having surgery over having 
surgery, but this preference might be reversed in the circumstances where 
surgery improves one’s long term health. We can defeasibly infer that the 
person prefers no surgery only as long as it is not known that surgery im- 
proves his or her long term health. 

Example 2: marriage. A girl named Sue prefers to be married to John over 
not, she prefers to be married to Fred over not, and at the same time she 
reasonably prefers to be married to neither over being married to both. We 
can defeasibly infer that she likes to be married to both from the first two 
preferences, but not from all three together. 
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The conditional goal ‘a is preferred (or desired) if /?’ is represented by the for- 
mula G{a\l3). Desirable properties of the logic of conditional goals are strength- 
ening of the antecedent to deal with irrelevant circumstances, weakening of the 
consequent to support reasoning about sub-goals, the conjunction rule for the 
consequent and cumulative transitivity (or cut) to combine goals, and the dis- 
junction rule for the antecedent to support reasoning by cases. Moreover, in 
contrast to most conditional and counterfactual logics the identity G{a\a) is an 
undesirable property, because if a is the case then it does not follow that there 
is a goal for a. The examples above illustrate that intuitively most proof rules of 
the logic of defeasible goals - i.e. goals which can be overridden by other goals - 
only hold in a restricted sense. This is in line with negative results obtained 
in the logic of preference, where counterexamples have been given to nearly all 
proposed rules [11]. In semantic terms the problem of the logic of preference 
is that there is no straight forward way to lift a preference relation between 
worlds, for example represented by a utility function, to a preference relation 
between sets of worlds or propositions. Pearl [12] also observed counterexamples 
to most proof rules when utilities and probabilities are combined in a so-called 
decision-theoretic logic of preference. We study restricted applicability with a 
global consistency constraint on proofs and with a local consistency constraint 
with phasing, as explained below. The main result of this paper is that the two 
approaches are equivalent. 

First, the restricted applicability of proof rules is enforced by a eonsisteney 
eonstraint on proofs [10]. For example, if we ignore complications raised by rea- 
soning by cases and overriding, then a proof is only valid if the set of the fulfill- 
ments of the goals at the nodes of its proof tree is consistent. Here a proof is a 
proof tree with - due to the restricted language - only conditional goals at its 
nodes, and the fulfillment of the goal G{a\l3) is the formula a A ft. This is a global 
constraint on proofs in the sense that it cannot be checked locally at eaeh deriva- 
tion step whether the constraint is violated or not. The consistency constraint 
blocks derivations from conflicting goals (consistency check on consequents), as 
well as derivations from goals with conflicting contexts (consistency check on an- 
tecedents). Moreover, in the logic of defeasible goals introduced in this paper the 
consistency check not only covers the premises a goal is derived from, but also 
other goals of the goal base it depends on. To formalize this dependence on the 
goal base we define an inference relation relative to the goal base, as is explained 
in detail later. With this extension we do not only cover conflicts and context 
dependence, but also the overriding of the surgery and marriage examples. 

Second, the restricted applicability of proof rules is enforced by restricting 
the order of applieation of the proof rules, together with a local consistency 
constraint at each derivation step. If we ignore the local consistency constraint 
and replacements by logical equivalents, then the proof theory is as follows. The 
goal G{a\fti A /? 2 ) is an atomie goal of goal base B if there is a goal G{a\fii) € B, 
the goal G{a\ A . . . A Q!„j/?) is a eonipound goal of B if there is a set of atomic 
goals {G{a\ \ft),G{a 2 \ft A ai), . . . , G{an\ft A ai A ... A Q!„_i)} of B and the goal 
G{a\fti V ... V Pn) is a eomhined goal of B if there is a set of compound goals 
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{G(q!i|/?i), . . . , (G(q!„|/?„)} of B such that a, h a for i = 1 . . . n. Atomic goals are 
derived from the goal base by strengthening of the antecedent, compound goals 
are derived from atomic goals by cumulative transitivity and the conjunction 
rule, and combined goals are derived from compound goals by weakening of the 
consequent and the disjunction rule. Hence, this is the only admissible order of 
proof rule application. 

The theorems of this paper imply that a goal can be derived from the goal 
base B with the global consistency constraint if and only if it is a combined 
goal of 5. To prove these surprising theorems the global consistency constraint 
as well as the ordering of proof rules is implemented in versions of Gabbay’s 
labelled deductive systems [5]. In particular, we use and adapt the labelled logics 
introduced by Van der Torre [15, 16] and Makinson [9] for defeasible goals. The 
labelled logic of conditional goals is based on a language of formulas of the type 
G{a\j3)L, to be read as ‘a is preferred (or desired) if /? in the proof context 
L.’ The label L collects the goals used in the proof to check the consistency 
of the fulfillments, and it collects the proof rules used in the proof to check 
the admissible orders. The formalization in a labelled deductive system has two 
advantages. First, as shown in [16], we can use rewriting techniques for two 
subsequent derivation steps, and generalize them inductively to rewrite arbitrary 
proof trees. Second, as shown in this paper, for the logic of defeasible goals we 
can reuse results of the logic of non-defeasible goals by proving a formal relation 
between the two logics. 

This paper is organized as follows. In Section 2 we introduce the logic, and 
in Section 3 we illustrate the logic by several examples. In Section 4 we prove 
the relation between this new logic plldg and its predecessor pllg, which is 
the basis of the completeness proof of the phased proof theory in Section 5. 

2 Phased labelled logics of defeasible goals (plldg) 

The first labelled logic of the type considered in this paper was proposed in [17] 
for obligations in a normative context. In this rudimentary logic the label of a 
premise only contains the consequent, and the consistency check considers the 
label together with the antecedent of the potentially derived goal. Two extensions 
of this logic have been proposed to cover the parallel tracks created through 
reasoning by cases, such that G(a|T) can be derived from the conflicting goals 
G{a A b\c) and G(a A -i6|-ic). Here T stands for any tautology like p V -ip. First, 
in [15] the so-called violation restriction is based on the material conditional 
13 ^ a oi premises G{a \ (3) (the negation of the material conditional (3 ^ a 
is the violation (3 A ^a), and it is suggested that this restriction is best suited 
for normative contexts.^ Second, in [9] the label contains a separate set of goals 

^ In [15] the violation restriction checks whether fulfilling a derived goal does not imply 
the violation of one of the goals it is derived from. However, to block some obvious 
counterintuitive derivations in Example 2 of that paper this should be strengthened 
to the restriction that fulfilling a derived goal does not imply the disjunction of the 
violations of the goals it is derived from. 
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for each case. The OR rule is thought of as creating different cases or parallel 
dependency tracks. Given a node in a derivation tree, we travel upwards the 
derivation tree generated by the node, splitting in two cases at every application 
of the OR rule, such that one case contains one of the two premise nodes and 
the other case contains the other premise node. As far as reasoning by cases is 
concerned the consistency check can either be based on material implications 
P ^ a, consequents a only, or fulfillments a A p. In this paper we study the 
latter most cautious option. 



Definition 1 (Cases). Consider a tree-like proof of a goal (the root) from a set 
of premises (the leaves). A ease or dependeney tree of a proof is represented by 
a set of nodes from the tree. The set of eases of the proof ean be eomputed with 
the following proeedure. 

1. Initially, the set C eontains one element, the set with the root. 

2. As long as there is a non-leaf node in some set of C without one of its 
parents, then either: 

— if the node is not derived by OR, then add its parents to the set; 

— otherwise ereate two eopies of the set, one for eaeh of its parents; add 
this parent. 

3. Eaeh set of nodes in C represents a ease of the proof. 



For example, consider the proof tree below of the derivation of G(a|T) from 
G{a A b\c) and G{a A -i6|-ic). There are two cases, {G{a A 6|c), G(a|c), G(a|T)} 
and {G{a A -i6|-ic), G(a|-ic), G(a|T)}, and the sets of fulfillments of both cases 
are consistent. 



G{a A b\c) 
G{a\c) 



G(a A —>b\—>c) 

wc — — - — ^ — wc 



G{a\—ic) 



G(a|T) 



OR 



For non-defeasible goals [16] a consistency check ensures that a goal may only 
be derived if the set of fulfillments of goals of each of its cases is consistent. It 
must always be possible to fulfill a derived goal together with each of the goals 
it is derived from, though - to support reasoning by cases - not necessarily all 
cases at the same time. In this paper we study labelled logics of defeasible goals, 
in which goals can be overridden by other dependent goals to resolve conflicts. 
The consistency check therefore also ensures that a derived goal does not violate 
a dependent goal. Here ‘dependence’ is formalized by an explicit relation (I) and 
‘violation’ is defined as follows. 



Definition 2 (Goal violation). A goal G(a\P) derived from G(a'\P') violates 
the goal G(ai\Pi) iff 

— a' A P' A (yPi a\) is elassieally eonsistent and 

— a A P A (Pi ai) is elassieally ineonsistent. 
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For example, the derivation of G{p\q) from G{p\T) violates the goal G(-ig|T), 
because {p,^q} is consistent whereas {pAq,^q} is not. Moreover, the derivation 
of G{j A /|T) from G(j|T) and G(/|T) violates the goal G{-i{j A /)|T). 

To facilitate the definitions, we do not collect all goals of a proof tree in the 
label, but only the premises together with so-called strengthenings, where the 
strengthening /?2 derives G{a\l3i A /? 2 ) from G{a\l3i) by applying the proof rule 
strengthening of the antecedent. It can easily be shown by an inductive argument 
that the set of fulfillments of premises together with strengthenings is equivalent 
to the set of fulfillments of all goals. Moreover, instead of collecting the proof 
rules explicitly we associate an integer with each proof rule called its phase, and 
only represent this integer in the label [16]. Each formula occurring as a premise 
has a label that consists of itself and phase 0. The phase of a goal is determined 
by the proof rule used to derive the goal, and the labels are used to check that 
the phase of reasoning is non-decreasing. Note that the inference relation 
is relative to the goal base, because the consistency check Rp refers to it. 

Definition 3 (plldg). Let C he a propositional base logie. The language of 
PLLDG eonsists of the labelled dyadie goals G(a\P)L, with a and j3 sentenees of 
C, and L a pair (F,p) that eonsists of a set of sets of strengthenings (C sentenees) 
together with unlabelled goals (F : fulfillments), and an integer (p: the phase). 

Let p he a phasing funetion that assoeiates with eaeh proof rule below an inte- 
ger ealled its phase, let a formula G(ci!|/l)({{(j(ct|/ 3 )}},o); where aAjd is eonsistent 
in C, be ealled a premise o/ plldg, and let the (extended) goal base B = (S,I) 
be a set of premises S together with an irreflexive and symmetrie relation I on 
S X S, where I(gi,g 2 ) should be read as ‘the goal g\ is independent of the goal 
g 2 -’ The goal base B derives a goal G(a\P)L in the phased labelled logie of de- 
feasible eonditional goals plldg for p ifG(a\P)L ean be derived from S with the 
inferenee rules below, extended with the following two eonditions R = Rp Rp. 

Rp: G(a\P)(p^p) may only be derived from G(ai\Pi)(p^p^^ (and G(a 2 \a 2 )(P 2 ,p 2 )) 
if for eaeh set Fi € F , G{a\j3) does not violate a goal that depends on some 
element of Fi, i.e. for whieh /(,) does not hold. Any rule is applieable if the 
faet that the labels of the premises are eonsistent with eaeh goal of the base 
implies that the label of the eonsequenee is eonsistent with eaeh goal of the 
base. 

Rpi G(a\j3)i^p^p) may only be derived if p > Pi (and p > P 2 ) for G(ai\Pi)(p^^p^^ 
(and G(a 2 \P 2 ){F 2 ,p 2 )) derived from. A rule is applieable only if no rule 
of a higher phase has been applied before. 

The inferenee rules of plldg are replaeements by logieal equivalents and the 
following four rules. 



^ G{a I l3i)(P,p),R 

G{a I /?i A I32){px{{i32}},p(SA)) 

CTA ‘ ^ ^ ^ T') (Fi , Pi) : G(,j3 I t) (P' 2 ,P2) 5 ^ 

G{aAj3 I 7)(PixP2.c(CTA)) 
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. <?(«! I I3)(f,p),R 

G{ai V Q !2 I P)(F,p(WC)) 

^ G{a I A)(fi,pi), <3(q; I /?2)(f2,p2),-R 

G(Q! I /?1 V /?2)(fiUf2.)9(OR)) 
where the produet is defined by 



{Si , . . . , S„} X {Ti , . . . , T,rt} — {SiUTi,...,SiU T„ 



,S„UT„} 



Let B = ({G(ai \ Pi), . . . ,G(an\ Pn)}, I) be an unlabelled goal base. We write 
B G{a I P) if and only if there is a labelled goal G{a \ P) l that ean be 

derived from the set of goals {G(o!i • ■ ,G{an\Pn)({{a„Ai 3 „}}, 0 )] ■ 

Moreover, we write B\^ ^^^^^_G{a\P) if and only if B G{a\P). 

Generalizing results from [16] in an obvious way we can show that the logic 
derives restricted versions of cumulative transitivity (ct) and the conjunction 
rule for the consequent (and) below. Moreover, under certain circumstances (for 
example when all rules are applicable in the same phase) we can equivalently 
replace CTA by these two rules. 



G(a|/?A7),G(/?|7) 
G{a I 7) 



AND : 



G(a|7),G(/?|7) 
G{a A /? I 7 ) 



In Section 4 we prove that we can recapture two previously proposed systems 
in our framework (because goals depend on themselves, i.e. I is irrefiexive) : 

— The phased labelled logic of conditionals goals pllg [16] is the plldg in 
which each goal depends only on itself; 

— The labelled logic of conditionals goals llg [15] is the pllg with phasing 
function p(x) = 1 . 



However, first we use the logic to illustrate the three reasons why reasoning 
about goals is non-trivial. We focus on the consistency check on the fulfillments. 



3 Examples 



The first example concerns conflicts. The logic plldg can reason about con- 
flicting goals, because we have G(p| T),G(-ip| T){^^^^^^G(p A ~'p|T) as well as 
G(p|T),G(-ip|T){^pjj^^G(g|T) if q is not logically implied by p or -ip. In par- 
ticular, for any phasing function p and independence relation I the consistency 
check blocks the second derivation step in the following counterintuitive deriva- 
tion. In the derivation below, as well as in following ones, the first blocked step 
is represented by a dashed line. 



<^(p|T)({{G(p|T)}},0) 

G(pV g|T)({{G(p|T)}}.p(wc)) 



wc 

- SA 



G(p V q I ~'P)({{g(p|t),^p}},p(SA)) *^(~'P|~'~)({{g(^p|t)}},o) 

G(g A ^p|T)({{g(p|t),^p,g(^p|t))}},p(CTA)) 
G(g|T)({{G(p|T).^p.G(-p|T)}}.p(WC)) 



CTA 
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The label of the goal potentially derived by SA contains besides the premise 
G{p\T) also the strengthening -ip. Consequently SA cannot be applied, because 
it would result in a goal with contradictory fulfillment p and strengthening -ip. 
In general, it is obvious that the two premises G(p|T) and G(-ip|T) cannot be 
combined in any single case, because the derived goal would contain the goals 
with contradictory fulfillments p and -<p. 

The second example concerns context sensitivity. Consider the two goals T 
prefer coffee’ G(c |T) and T prefer tea if there is no coffee’ G{t\^c). The logic 
PLLDG combines strengthening of the antecedent (sa) with weakening of the 
consequent (wc) without validating the following counterintuitive derivation of 
T prefer coffee or no tea if there is no coffee’ G(cV -it|-ic) from the first premise. 

<5(c|T)({{G(c|T)}}.0) 

— wc 

G(c V -'t|Tj(.{.{(j(c|T)}},p(wC)) 

SA 

G(c V -'thc)({{G(c|T).^c}}.p(SA)) 

The SA rule cannot be applied, because the label of the potentially derived goal 
contains besides the premise G(c |T) also the strengthening -ic. In general, we 
call a goal G{a\l3) a secondary goal of the primary goal G(q!i |/?i) if and only 
if /? A Q!i is classically inconsistent, and a secondary goal can neither be derived 
from a primary one, nor can they be combined in a single case. 

The third example concerns overriding. Consider the goals ‘Sue prefers to 
marry John over not’ G(j|T) and ‘Sue prefers to marry Fred over not’ G(/|T). 
The PLLDG derivation below shows that the goal ‘Sue prefers to marry John and 
Fred’ G{j A / |T) can defeasibly be derived, because the conjunction rule can be 
derived from SA and CTA if p(CTA) > p(SA). 

G(i|T)({{G(j|T)}},0) 



G(il/)({G(i|T),/}},p(SA)) ^(/|T)({{G(/|T)}},0) 

G{j A /|T)({{G(i|T)./.G(/|T)}}.p(CTA))) 

Now consider the additional goal ‘Sue prefers not to be married to both John 
and Fred’ G(-i(j A /)|T). If the goals are independent, i.e. I contains at least 
(G(j I T),G(-.(j A /) I T)) and (G(/ | T),G(-.(j A /) | T)), then the derivation 
above is still valid, because additional independent premises do not influence 
valid derivations. Otherwise the derivation is not valid, because the SA rule is 
blocked. The counterintuitive goal cannot be derived, because the fulfillment of 
G{j A /|T) violates the additional premise. 

G(j|T)(^{G(i|T)}}.0) 

SA 

G(il/)({G(i|T),/»,p(SA)) ^(/|T)({{G(/|T)}},0) 

G{j A /|T)({{G(i|T)./.(/|T)}}.p(CTA)) 

Note that the fact that the conjunction rule only holds non-monotonically 
implies that we have to be careful whether we formalize a goal for p A g by the 
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formula G{p) A G{q) or by G{pAq), where we write G{a) for G(q!|T). Although 
we have G(p), G{p A and G{p A they 

derive different consequences when other goals are added to the goal base. For 
example with only dependent goals we have G{p) , G{q) , G{r)[^ A q)) 
and G{pA q),G{r) J(^„,„^G{r\^{p A q)), and with dependent goals we also have 
G(p),G(g),G(rhp) A and G(p A g), G(rhp)|~^__^^G(rhp A ^g). 

Hence, both representations are incomparable, in the sense that in some circum- 
stances the former is more adventurous, and in other circumstances the latter. 

Finally, consider dependent goals in Bi = ({G(p), G(g)}, 0) and indepen- 
dent goals in B 2 = ({G(p), G(g)}, {(G(p), G(g)), (G(g), G(p))}). The fulfillment 
of G{p I -ig) violates G(g | T), and the former goal can therefore only be de- 
rived when the two goals are independent. We thus have Bi\/'^^^^^G{p\^q) and 
-® 2 H„„„G(p|-ig). In other words, we have inheritance to sub-ideal subclasses (-ig) 
only for independent goals. Due to lack of space we cannot further discuss the 
independence relation and the procedure to update it when additional goals are 
added to the goal base. 



4 The relation between plldg and pllg 



The following proposition shows that plldg derives a subset of pllg introduced 
in [16]. 

Proposition 1. Let PLLG be the plldg in whieh the eonsisteney eheek R is 
replaeed by the following eheek Rf-pu.g on the eonsisteney of the fulfillments of 
all elements of F. 

Rf-„.,.g' G(a\P)(F,p) may only be derived if the strengthenings together with the 
fulfillments of goals of eaeh Fi G F are eonsistent. 

Note that the related inferenee relation hp„„ of pllg is not relative to the goal 
base, beeause Rf-pu.g does not refer to it. If we have then we have 

B a (but obviously not neeessarily viee versa). 

Proof We prove that if Rf holds, then Rf-p,.pg also holds. Assume Rf-pu.g does 
not hold. This means that the derived goal violates the goal it is derived from. 
Therefore Rf does not hold, beeause eaeh goal depends on itself (I is irreflexive) . 

The following proposition shows how plldg can be defined as pllg with 
additional restrictions. 



Proposition 2. Let B = ({G(ai\Pi),. . . ,G(an\Pn)},I) be an unlabelled goal 
base, and let the materializations of dependent goals of G{oii\j3i) of B = (S,I) 

G(a,|/1,) g5, ] 

not I{G{ai\l3i),G{aj\l3j)), > 



be MDG(G(ai\pi) | 5 ) = <^ {/?,- ^ aj} 



ai A Pi A aj A P^ 



is eonsistent j 
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We have if and only if there is a labelled goal G(a\P)L that 

ean be derived from the set of goals 

{G(ai\Pi)^^Q(^ai\Pi)}}xMDG{G{ai\Pi)\B) I 1 < * < »^} 

in PLLG, i.e. in plldg in whieh the eonsisteney eheek Rp is replaeed by the 
eheek Rf-pu.g of Proposition 1. 

Proof We prove that Rp implies Rf-„.,.g and viee versa. => Assume Rf-pu.g 
does not hold. This means that there is an Fi G F whieh is ineonsistent, and 
this Fi eontains a ease with an element of MDG, say {G(ai\Pi), Pj aj}. This 
represents exaetly that the derived goal violates the goal G(aj\Pj) (that depends 
on G(ai\Pi)). Consequently Rp does not hold. <^= If Rf-pug holds then, arguing 
analogously, there is no violated dependent goal. Henee Rp holds. 

In the following section we consider surprising consequences of this relation: 
modified versions of the important phasing theorems of pllg also hold for plldg. 

5 The one-phase logic lldg and four-phase logic 4 lldg 

We are interested in the following one- and four-phase logic, that represent the 
two systems discussed in the introduction with respectively a global consistency 
constraint and a local consistency constraint with phasing. 

Definition 4 (lldg,4lldg). The logie lldg is the plldg with the phasing 
funetion p defined by p{SA) = 1, p(CTA) = 1, p{wc) = 1, p(OR) = 1. The logie 
4lldg is the plldg with p(sa) = 1, p(cTA) = 2, p(wc) = 3, p(or) = 4. 

In Theorem 1 below we show that the phasing does not restrict the set of 
derivable goals: for each lldg derivation there is an equivalent 4lldg derivation. 
This proof rewriting theorem is analogous to Theorem 1 for llg and 4llg in [16] . 
We first prove two propositions. The first says that for each case the fulfillments 
of the premises of the case together with the strengthenings imply the fulfillment 
of the derived goal. 

Proposition 3. For eaeh goal G(a\P)(F,p) derived in plldg we have for eaeh 
Fi G F that the fulfillments of goals in Fi together with the strengthenings in Fi 
elassieally imply a A p. 

Proof By induetion on the strueture of the proof tree. The property trivially holds 
for the premises, and it is easily seen that the proof rules retain the property. 

The second proposition says that we can rewrite two steps of an lldg proof 
into a 4lldg proof. 

Proposition 4. We ean replaee two subsequent steps of an lldg derivation by 
an equivalent 4lldg derivation without violating a eonsisteney eonstraint. 
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Proof In Proposition 3 in [16] this has been proven for llg and 4llg. It can he 
generalized for lldg and 4lldg, because plldg derivations can be considered 
as PLLG derivations with modified labels for premises, see Proposition 2. To 
illustrate this generalization, we repeat one reversal in Figure 1 below. We have 
to prove that if the first derivation does not violate Rp, then neither does the 
second derivation. Call the sets of fulfillments of the three premises respectively 
Fi, F 2 and F 3 . The first item of the label of the first derived conditional is 

Fi X (F2 U F3) = (Fi X F2) U (Fi X F3) 
and the first item of the label of the second derived conditional is 
(Fi X {71} X F2) U (Fi X {72} X F3) 

The two sets contain equivalent formulas, because each element of F 2 classically 
implies 71 and each element of F 3 classically implies 72 (Proposition 3). The 
other proofs are analogous. 



G(/3|7i) GiflM 



G{a\l3 A (71 V 72)) G(/3|7i V 72) 



OR 



CTA 



G(a\fl A (71 V 72)) 
G(a |/3 A 71) 



SA 



G{a A / 3 | 7 i) 



G{a A /3|7i V 72) 

G{a\/3 A (71 V 72)) 
G(/3|7i) G(a |/3 A 72) 

CTA 



SA 



G(/3|72 



G{a A /3I72) 



CTA 



G(a A /3|7i V 72) 

Fig. 1. e. 2 . Reversing the order of OR4 and CTA2 



OR 



From the two previous propositions follows the following generalization of 
Theorem 1 of [16]. This theorem illustrates the advantage of labelled deductive 
systems to inductively generalize of the rewriting of two subsequent steps to the 
rewriting of arbitrary proofs. 

Theorem 1 (Equivalence LLDG and 4lldg). Let B he a goal base. We have 
if 0''nd only if 

Proof <1= Every 4lldg derivation is a lldg derivation. => We can take any 
LLDG derivation and construct an equivalent 4lldg derivation, by iteratively 
replacing two subsequent steps in the wrong order by several steps in the right 
order, see Proposition f. 

Moreover, Theorem 2 shows that in 4lldg the global consistency check on 
the label can be replaced by a local consistency check on the conjunction of 
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the antecedent and consequent of the goal. This is different from the analogous 
Theorem 2 for 4llg in [16], because for non-defeasible goals a consistency check 
on phase- 1 goals is sufficient, whereas for defeasible goals we have to check 
phase-2 goals too. 

Theorem 2. Let Rac be the following eondition. 

Rac' G(a\P)(^p^p^ may only be derived from G(ai\Pi) (and G(a 2 \a 2 )) if it does 
not violate a dependent goal (of the premise used to derive G{ai\j3i) (and 
G(a2 |Q!2)^/ 

Consider any potential derivation of 4lldg, satisfying the eondition Rp but not 
neeessarily Rp- Then the following eonditions are equivalent: 

1. The derivation satisfies Rp everywhere, 

2. The derivation satisfies eondition Rac throughout phase 1 and 2. 

Proof Through phase 1 and 2, F in the label has a unique element. The eon- 
junetion of anteeedent and eonsequent of the derived goal is equivalent to the 
eonjunetion of fulfillments of goals of its label and the strengthenings. Henee, in 
phase 1 and 2 Rp holds iff Rac holds. In phase 3 and 4 no goals ean be violated. 

This proves the completeness of the phased proof theory given in the intro- 
duction, because we can simply define atomic goals as phase-1 goals, compound 
goals as phase-2 goals, and combined goals as phase-4 goals. The goal G{a\l3) 
is an atomie goal of goal base B if there is a goal G{a \jl') & B such that jl 
classically implies /?', i.e. P P' , and aAP does not violate a goal dependent on 
G{a\P'). The goal G(q!i A . . . A Q!„|/1) is a compound goal of B if there is a set of 
atomic goals {G{ai\P),G{a 2 \P A cri), . . . , G{an\P A a\ A ... A Q!„_i)} of B such 
that PAa\A...Aai does not violate a goal dependent on the goal used to derive 
G{ai-i 1/1 A cri A ... A ai- 2 ) for / = 2 . . . n. The goal G{a\Pi V ... V /!„) is a com- 
bined goal of B if there is a set of compound goals {G(q!i \Pi), . . . , (G(q!„|/ 1„)} 
of B such that a, h a for / = 1 . . . n. 



6 Conclusions 

In this paper we studied a labelled logic of defeasible goals, with several desirable 
properties. The logic does not have the identity or conflict averse strategies, but it 
has restricted strengthening of the antecedent, weakening of the consequent, the 
conjunction rule, cumulative transitivity and the disjunction rule. We introduced 
phasing in the logic and we proved two phasing theorems. Phasing makes the 
proof theory more efficient, because only a single order of rule application has 
to be considered. 

Restricted applicability is of course well-known from logics of defeasible rea- 
soning, where monotony and transitivity only hold in a restricted sense and where 
restricted alternatives of monotony have been developed, e.g. cautious and ra- 
tional monotony. However, it is important to note that the logic of defaults is 
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quite different from the logic of defeasible goals. For example, according to the 
generally accepted Kraus-Lehmann-Magidor paradigm [7] the identity (called 
supra-classicallity) holds in the logic of defaults. Moreover, the conjunction rule 
holds without restrictions, whereas in the logic of goals it has to be restricted 
to formalize the marriage example. This distinction is not surprising, because 
the quantitative counterparts - utilities and probabilities - also have completely 
different properties. A systematic study of these two types of non-monotonic 
reasoning is the subject of present investigations. 
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1 Introduction 

Computation in a number of uncertainty formalisms has recently been revolu- 
tionized by the notion of local computation. [13] and [9] showed how Bayesian 
probability could be efficiently propagated in a network of variables; this has 
already lead to sizeable successful applications, as well as a large body of litera- 
ture on these Bayesian networks and related issues (e.g., the majority of papers 
in the Uncertainty in Artificial Intelligence conferences over the last ten years). 

In the late ‘Eighties, Glenn Shafer and Prakash Shenoy [18] abstracted these 
ideas, leading to their Local Computation framework. Remarkably, the propaga- 
tion algorithms of this general framework give rise to efficient computation in a 
number of spheres of reasoning: as well as Bayesian probability [16], the Local 
Computation framework can be applied to the calculation of Dempster-Shafer 
Belief [18,12], infinitesimal probability functions [21], and Zadeh’s Possibility 
functions. 

This paper shows how the framework can be used for the computation of 
logical deduction, by embedding model structures in the framework; this is de- 
scribed in section 3, and is an expanded version of work in [22]. This work can 
be applied to many of the formalisms used to represent and reason with common 
sense including predicate calculus, modal logics, possibilistic logics, probabilistic 
logics and non-monotonic logics. 

Local Computation is based on a structural decomposition of knowledge into 
a network of variables, in which there are two fundamental operations, com- 
bination and marginalization. The combination of two pieces of information is 
another piece of information which gives the combined effect; it is a little like 
conjunction in classical logic. Marginalization projects a piece of information 
relating a set of variables, onto a subset of the variables: it gives the impact of 

A. Hunter and S. Parsons (Eds.): ECSQARU’99, LNAI 1638, pp. 386-396, 1999. 

© Springer-Verlag Berlin Heidelberg 1999 
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the piece of information on the smaller set of variables. Axioms are given which 
are sufhcient for the propagation of these pieces of information in the network. 
General propagation algorithms have been dehned, which are often efhcient, de- 
pending, roughly speaking, on topological properties of the network. The reason 
that Local Computation can be very fast is that the propagation is expressed 
in terms of much smaller (‘local’) problems, involving only a small part of the 
network. 

Finite sets of possibilities (or constraints) can be propagated with this frame- 
work, and so deduction in a hnite propositional calculus can be performed by 
considering sets of possible worlds; this is implemented in, for example, PUL- 
CINELLA [15], and described formally in [17]. However, dealing with sets of 
possible worlds is often not computationally efhcient; it is only recently [7,8] 
that it has been shown how to use Local Computation to directly propagate sets 
of formulae in a hnite propositional calculus. 

In the next section we introduce the Local Computation framework. We 
describe in section 3 how model structures can be embedded in the framework. 
This is applied to hrst-order predicate calculus in section 4. The last section 
discusses further applications and advantages of this approach. 

2 Axioms for Local Computation 

The primitive objects in the Local Computation framework are an index set y 
(often called the set of variables) and for each r C y a set Vr , called the set 
of r-valuation, or, the valuations on r (an r-valuation is intended to represent 
a piece of information related to the set of variables r). The set of valuations 
V is dehned to be UrCx^*" assume a function ® : V x V — ;> V , called 
combination, such that if A G Vr and B ^ Vs then A® B ^ Vrus, for r,s C x- 
If A and B represent pieces of information then A® B is intended to represent 
an aggregation of the two pieces of information. We also assume that, for each 
r C X, there is a function f r : called marginalization to r. It 

is intended that A^ will represent what piece of information A implies about 
variables r. 

The framework assumes that the following axioms are verihed: 

Axiom LCl (Commutativity and associativity of combination): Sup- 
pose A, B and C are valuations. Then A®B = B®A and A ®{B®C) = 
(A®B) ®C. 

Axiom LC2 ( Consonance of marginalization ): Suppose A is a t- valuation 
and r C s C t C X- Then (A^® )'*'’’ = A^ . 

Axiom LC3 (Distributivity of marginalization over combination) 

Suppose A is an r-valuation and B is an s- valuation and r C t C rUs C y. 

Then (A (g) B)A = A (g) 

^ LC3 is slightly stronger than the corresponding axiom A3 given in [18], (their axiom 
is LC3 but with the restriction that r = t); it turns out to be occasionally useful to 
have this stronger axiom. 
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Let Ai, . . . , An be valuations with, for i = I, . . . ,n, Ai G Vr, ■ Many problems 
can be expressed as calculating (Ai ® • • -Cgi^n for some subset of variables of 
particular interest vq C \J^-i r'i\ the valuation (Ai ® ® will represent 

the impact of all the pieces of information on variables ro; in Bayesian probability 
this computes the marginal of a joint probability distribution; we will show below 
how testing the consistency of a set of formulae in propositional or predicate 
calculus can be expressed in this way. 

Elements of Vr will generally be simpler objects when r is small; for exam- 
ple they may be sets of formulae using only a small number of propositional 
symbols; also combination and marginalization will generally be much easier on 
the simpler objects (the computational complexity of these operations is typ- 
ically exponential in |r|). Direct computation of (Ai ® ® An)^^° will often 

be infeasible as it involves a valuation in Vr where r = ur=iD-it can be seen 
that axioms LCl, LC2 and LC3 allow the computation of (Ai ® ® An)^^° 

to be broken down into a sequence of combinations and marginalizations, each 
within some Vr, (i-e., local computations), if % = {ro, . . .,r„} is a hyperforest. 
Briefly, is said to be a hyperforest if its elements can be ordered as sq , • • • , Sn 
(known as a construction sequence) where, for i = 1, . . .,n, there exists ki < i 
with Si n ^ other words, the intersection of with the union of 

the earlier sets is contained in one of the earlier sets, and, if we remove s„, the 
same applies to Sn-i, and so on. Also if W is a hyperforest, for any r,- GW, we 
can choose the construction sequence so that sq = ^i, just as we can choose any 
node of a tree to be its root. The complexity of the computation will typically 
be roughly exponentially related to max,- |rjj. If W is not a hyperforest then we 
can perform the computations in a hyperforest Q which covers W, i.e., such that 
for all r GW, there exists s ^ Q with s D r. Finding a good hyperforest cover 
has been studied in e.g., the graph theory and statistics literature, see [9], where 
several efhcient and useful heuristic methods have been produced. 

More precisely^, to calculate (Ai ® we choose a hyperforest cover 

of {ro, . . . , r„} with construction sequence sq, . . . , Sm chosen so that sq A rg. For 
i = 1 . . .m, let Xi = compute the combination A = (^{Aj : 

rj C s„}. Let r = {j : rj ^ s™}; note that if i G r then r,- C Xm-i so r,- C 
Xm-i - By LCl (Ai (g) • • • (g) A„)Ao = (A (g) Aj)^’’° , which by LC2 can be 

written as ((A ® Aj)'*'^™-i which equals ((ylJ'®"»'^Xm-i ® Aj 

using LC3 and commutativity of ®. We can then repeat the process, computing 
the combination on Sm_iand marginalizing to Sm-i (^Xm- 2 , and so on, until we 
obtain the marginal within sq which we can then marginalize to tq. 

3 Similarity Model Structures 

To apply the Local Computation framework to theorem proving in a particular 
logic, we need to dehne appropriate combination and marginalization operations, 

^ The reader may well find this algorithm easier to understand by first going through 
the example in 3.3. 
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and verify the axioms LCl-3 for those operations. Unfortunately, the latter (in 
particular the verihcation of LC3) tends to be far from straight-forward. In 
this section we introduce an intermediate framework, that of Similarity Model 
Structures, which has a simple embedding in the Local Computation Framework. 
Expressing logics in terms of Similarity Model Structures can often be done very 
easily via their semantics. The advantage of this two-stage approach is that the 
verihcation of the Local Computation axioms tends to be much easier, because 
of the theorem given below. 

A Similarity Model Structure is dehned to be a triple (A4, x, (~r)rCx)) where 
Ad is a set, the elements of which are called models, x is an indexing set, and 
each is an equivalence relation on Ad . For this paper we will also assume the 
following monotonicity property: for r C s C x, For M, N G Ad, r C x, 

dehne M'*'’’ to be : N M}, and for ACM, dehne A'*'’’ to be Umga M'*'’’. 
If A'i’’ = A we say that A is r-closed. 

3.1 Embedding Similarity Model Structures in the Local 
Computation Framework 

To embed Similarity Model Structures in the Local Computation Framework we 
need to dehne r- valuations and the operations Combination and Marginalization. 
We use the same indexing set y; for r C y, the set of r- valuations Vr is dehned 
to be the set of r-closed subsets of Ad . For A G Vj , we have already dehned its 
result under r-marginalization. A'*'’’. For A C Vr and B C Vg dehne A C 5 to be 
A n 5 which can be shown to be an element of Vros • 

It can easily be seen that axioms LCl and LC2 are automatically satished 
for this embedding of Similarity Model Structures, but LC3 does not always 
hold, and is more problematic. 

Similarity Model Structure (Ad, y, («r)rCx) is said to satisfy the Indepen- 
dence Property if 

for any r, s C y and M , N C M such that M «ros N , there exists L C M 
such that L PSr M and L N . 

This property may be paraphrased as: knowing the ^r -equivalence class A of 
an unknown model L doesn’t tell us anything about its k,, - equivalence class B, 
except that B and A are both subsets of the same ^rns -equivalence class (i.e., 
that containing L ). 

Theorem 1. A Similarity Model Structure (Ad, y, («r)rCx) seitisfies the Inde- 
pendence Property if and only if its embedding in the Local Computation frame- 
work satisfies the distributivity axiom LC3. 

This result makes the checking of the Local Computation axioms much easier 
for many logics, since we just have to verify the Independence property, which 
often relates naturally to the semantics of the logic. 

3.2 The Propositional Calculus 

To illustrate the frameworks we will consider the propositional calculus based 
on set of propositional symbols y = {Pi^P'z, • • Let Ad be the set of truth 
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functions, i.e., functions from x to {T, F}. For r C x, define hy M N iff 
M and N agree on r, i.e., for all Pi G r, M{Pi) = N{Pi). Each «r-equivalence 
class corresponds to an r-partial model, i.e., a function from r to {T, F}. Flence 
r-closed sets may be thought of as sets of r-partial models. Using the above 
embedding, marginalizing a set A of s-partial models to r C s amounts to 
restricting them to r. If 5 is a set of t-partial models then A® B is the set of all 
Man, with M E A, N E B such that M and N agree on sflt, where M A N is 
the sUt-partial model which agrees with M on s and with N out. The fact that 
such an sUt-partial model always exists implies that the Independence Property 
is satished, so the Local Computation axioms hold. 

Suppose, for i = l,...,n. Pi is a set of formulae involving only a hnite 
number of propositional symbols r,- C y. We can check if Pi is consistent or 
not by seeing if is non-empty (where [U] is the set of truth functions 

satisfying Pi, which is an rj-closed set). To do this we hnd a hyperforest cover of 
[n : i = I, ... ,n} using a standard algorithm, and perform local computations 
with sets of partial models. 

The same approach can be used for a wide range of monotonic logics for 
which partial models can be dehned. 

3.3 Example 

Suppose we are given the collection of propositional formulae {R A ~'Q) D 
~^P, {T = P) V , R D S , S D , T D {R V W) , {T A U A {V = 
IT)) V (-iT A -lU A -iIT) , T V X , V D X and we want to see if we can 
deduce df or ~<X. We therefore let y = {P,Q, R, S,T,U,V,W, X} and con- 
sider R = {{U, Q, R}, {P, R, T}, {Q, S}, {R, S}, {R,T, IT}, {T, U, V, IT), {T, X}, 
{V,X}}, each set of R being the set of propositional variables contained in 
one of the formulae. R is not a hyperforest, but is fairly close to being one. Q 
= {{U, Q, R}, {P, R, T}, {Q, R, S}, {R, T, IT), {T, U, V, IT), {T, T, X}} is a natu- 
ral hyperforest cover of R given by even crude algorithms for Ending hyperforest 
covers. Since none of the sets in Q is large, the computation will be efhcient; if 
we were forced to use a hyperforest cover with a large set, then the computation 
would probably be much less efhcient. Since we want to marginalize to {df} we 
choose a construction sequence ending in {T, V, X}, which in this case is unique: 
{Q, R, S}, {P, Q, R}, {P, R, T}, {R, T, IT), {T, U, V, IT), {T, V, X}. 

We begin by considering all the formulae on S'), i.e., R A S and 

S A ~<Q, combine their respective sets of partial models by finding all the 
{Q, i?, S'j-partial models that satisfy both formulae, i.e., (T,F,F), (F,T,T), 
(F,F,T), (F,F,F) where the co-ordinates are in the order Q,R,S. We then 
marginalize this set of partial models to the intersection of S') with the 

rest of the hyperforest, that is, {Q, R} (in other words, we delete S). This gives 
the {Q, i?j-partial models (T, F), (F, T), (F, F), by dropping the last co-ordinate. 
As one would expect, these are the set of models of R A ~'Q. Note that we will 
now have no further use for the two formulae involving S. We next combine 
on {P,Q,R} and marginalize to {P, R}', as well as the {Q, i?j-partial mod- 
els just computed, this involves the partial models of {R A ~'Q) A ~'P , giv- 
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mg {P,Q,i?}-partial models (T,T,F), (T,F,F), (F,T,F), (F,F,T), (F,F,F). 
Marginalizing to {P, R] gives (T, F), (F, T), (F, F), the {P, i?}-partial models of 
the formula R D -i_P. We then combine on {P, R,T} and marginalize to {R, T}, 
deleting P, giving the set of {i?, T}-partial models of -li? V ->T. Combining on 
{R,T,W} and deleting R gives the set of partial models of T D FF; combin- 
ing on {T, U,V,W} and marginalizing to {T, V} gives -iT V V. Combining on 
{T, V,X} gives the set of {T, V, df}-partial models of ~<T\/ V, TV X and V D X , 
i.e., {(T, T, T), (F, T, T), (F, F, T)}. This incidentally shows that the formulae 
involving just variables T, V, X which are implied by our original set of formulae 
are exactly those which are true in all three of these partial models. Finally, 
marginalizing to {df} gives the singleton set {(T)} so df must be true. Flence we 
have shown that df is implied by the given set of formulae, and this set is also 
consistent (if it had been inconsistent, marginalizing to {df} would have given 
the empty set). 

Table 1 illustrates the computation. The hrst column shows the variables 
involved in the combination at each stage. The second column gives the set of 
variables to which we marginalize. The formulae still relevant to the computation 
are shown in the third column, the ones involved at that stage being underlined. 





i 




Q, R,S 


Q,R 


RD S,SD^Q,RA^QD^P,{T= P)V^R,TD RVW,<P 


P,Q,R 


P,R 


R D -'<5, RA^Q D -.p, (T = P) V -.p, T D RVW,0 


P,R,T 


R,T 


RD^P,(T= P)V^R,TD RVW,<P 


R,T,W 


T,W 


-■P V -.T, T D RV W,0 


T, U, F, FF 


T, U 


T D FF, (T A C A (U = FF)) V (-.T A -.P A -.FF), TV X,V D X 


T,V,X 


A 


~^TV V,TV X,V D X 






X 



where <l> = {(T A C A (U = LF)) V (-.T A^U A -.LF), TV X,V D X} 
Table 1. Sequence of combinations and marginalizations in the example 



Note that the Local Computation axioms LCl-3 ensure that the above pro- 
cedure does indeed compute the marginal of the set of formulae to {N} (see 
section 2), so soundness and completeness of this proof procedure are automat- 
ically verihed. 

Although we have dehned the combination and marginalization operations in 
terms of sets of partial models, as the example strongly suggests, these operations 
can be implemented using formulae. 

4 Application to first-order logic 

In order to apply Local Computation to hrst-order logic. We consider a logical 
language £, generated in the usual way from a set y of function and predicate 
symbols, a set of individual variable symbols Var, and connectives and delimiters. 
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For r C X, let be the sublanguage of C comprising formulae, the function 
and predicate symbols of which are all in r. Let A4 be the set of models on y: 
each model M G is dehned by its universe Um and, for each n-ary function 
symbol / G y, an n-ary function on Um, and for each n-ary predicate symbol 
P an n-ary relation on Um- The set of models of a subset P of £ is noted [P], 
For each r C y an equivalence relation on A4 can naturally be dehned by: 
M N if and only if M and N have the same universe and give the same 
interpretation to the symbols of r. 

For any r, s C x and M , N ^ M such that M «ros N , let L be the model of 
universe Ul = Um =Un which gives to each symbol in r the same interpretation 
as M , and gives to each symbol in s the same interpretation as N\ this model 
is well-dehned since M «ros N , and clearly L M and L N. Thus the 
similarity model structure {A4,x, (~r)rCx) satishes the Independence Property, 
and can be embedded in the Local Computation Framework. Notice that if we 
consider the set A4h of Herbrand models of £, [Mh,X, (~r)rCx) still satishes 
the Independence Property, since the model L constructed above is a Herbrand 
model if M and N are. 

Suppose now that we have a family {Pi)i of subsets of C, each Pi being more 
precisely a subset of some Cr, with r,- C y, and that we want to check the satish- 
ability of Pi . It can easily be checked that the set of models of Pi is empty if 
and only if is empty. Performing marginalization and combination on 

sets of models would often not be practical. However it is possible to work with 
hrst-order representations of sets of models whenever it is possible to dehne a 
function MARG such that MARG(T, r) C and [MARG(T, r)] = [T]'*'’’. In this 
case = [MARG(UjTi, 0)]. The formulas in MARG(UjTi, 0) do not 

contain any predicate or function symbols (except possibly the equality predi- 
cate). More importantly we can look for a hyperforest cover of {r,- : i = 1, . . . , n} 
using a standard algorithm, and perform local computations of MARG on sets 
of formulae. In the remainder of this section, we review some existing algorithms 
to compute the marginalization of sets of formulae. 

Although Theorem Proving and Gonsistency Ghecking have been the focus of 
much research on Automated Deduction, it has recently become clear that a more 
general approach can be particularly useful, especially in Artihcial Intelligence. 
In this approach, known as Gonsequence Finding, instead of just trying to deduce 
one formula, a theorem prover would be asked to hnd all consequences of a set 
of premises that verify a given property. This approach has been formalized, in 
the case of clauses without the equality predicate, using the notion of production 
field [19,5]: a production held V is dehned by a set L of literals closed under 
instantiation; we then write V = (L). A clause C belongs to V if every literal in C 
belongs to L. Given a set of clauses E, [5] dehnes the set of characteristic clauses 
of E with respect to V, noted Garc(27,'P), to be the set of clauses belonging to 
V that are entailed by E and that are not subsumed by any other consequence 
of E belonging to V . If we dehne to be the set of literals whose predicate and 
function symbols are all in r, it can be shown that, if [E]h is the set of Herbrand 
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models of S, [S]^ — [Carc(d7, (Lr))]ij. Algorithms to compute Care can be 
found in e.g. [14,5]. Notice that these algorithms will not always terminate, 
since Carc(d7, (L^)) may be inhnite. 

Consequence Finding is highly relevant for Automated Reasoning, and a 
nonmonotonic logic that clearly illustrates the connection is Circumscription. 
If we consider two disjoint sequences P and R of predicate symbols of y, the 
circumscription of R in a hrst-order sentence W with variable R is dehned in 
[11] to be the second-order sentence 

W A -3P, R(IF[P, R] A (P < R) A (R ^ P)) 

where P and R are two sequences of predicate variables similar^ to R and 
R respectively, LF[P,R] is the sentence obtained by replacing in W each oc- 
currence of a predicate constant R in R (resp. R in R) by the corresponding 
predicate variable P of P (resp. R of R), and P < R is an abbreviation for 
the sentence Ap^pVxp(P(xp) D P(xp)). A model-theoretical characterization 
of circumscription is obtained by means of an ordering <p p on the set of mod- 
els: M <p p N if M and N have same universe, give the same interpretation 
to all symbols that are not in R U R and for all R G R, \P\j^ C |R|jy. Using 
the notations introduced in the previous section, M <p p N if M ^ 

and for all P ^ P, \P\j^ C |R|jy. The models of the circumscription of R in a 
hrst-order sentence W with variable R are then the <p. ^-minimal models of W . 

Let r be the set of function and predicate symbols of W that are not in PUR. 
It can be proved that [IT]'*'’’ = [3P, R(IU[P, R])] . Therefore, several algorithms 
designed in order to eliminate existentially quantihed predicate symbols can 
be used in order to perform marginalization. The algorithms of [10,1] always 
terminate but succeed only in cases where (f) can be put under disjunctive normal 
form such that each conjunct contains no positive occurrence of R or no negative 
occurrence of R. The algorithms of [20,2] apply to general formulas but do not 
always halt. The elimination of function symbols is the reverse of Skolemization, 
and is also performed by these algorithms. 

Several works have also highlighted the importance of consequence hnding 
for answering queries about a circumscriptive theory without having to compute 
a hrst order equivalent of the theory being circumscribed [14,4,3,6]. It can 
be proved that if IT is a well-founded theory (that is, if every model of W is 
minored by a <p. ^-minimal model of W), and if cj) is a sentence that contains 
no occurrence of predicates in R, then (f) is a consequence of the circumscription 
of R in IT with variable R if and only if [W] C sup<^ ^ ([IT U {ti}]) = [1T[P, R] U 

{(j)[P], P < R}]'*'’’ where r' is the set of predicate and function symbols that 
appear in W U (f), and (?i[P] is the sentence obtained by replacing in (f) every 
occurrence of a predicate R G R by the corresponding P G P. 



^ We say that the two sequences of predicate symbols R = (Ri,...,Rm) and P = 
(Pi, . . . , Pn) are similar if m = n and P, and Pj have the same arity for I < i < n. 
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5 Summary and Discussion 

We’ve introduced an intermediate framework, that of Similarity Model Struc- 
tures, which facilitates the application of Local Computation techniques to the- 
orem proving and consequence hnding. For a given logic, the Local Computation 
algorithms produced will be sound and complete as long as the implementations 
of combination and marginalization for that logic are sound and complete. The 
efhciency of the algorithms depends on the efhciency of these implementations, 
and whether a good hypertree cover can be found easily for a particular problem. 
As well as illustrating our approach with propositional logic, we’ve shown how 
it can be applied to hrst-order theorem proving and predicate circumscription. 
The approach can also be applied to modal and conditional logics. 

The approach also applies to probabilistic [23] and possibilistic logics, and 
certain, restrictive, non-monotonic logics which are based on simple conditional 
logics. Apart from theorem proving, a number of other problems can be expressed 
in terms of marginalization to a non-empty set of variables, for example, in power 
structures, correspondence theory, semantics for Hilbert calculi [2], abduction [5]. 

Local Computation methods allow us to break down problems into smaller 
ones to which classical theorem proving techniques can be efhciently applied; for 
example, the framework gives strategies for choosing in which order to perform 
resolutions. It is also possible to get information about the complexity of a 
particular calculation, by considering the size of the largest set in the hyperforest; 
in the same way, some formulae which make the computation much worse can 
be recognized as those that increase the size of this largest set. 

Although it is not yet clear how theorem proving algorithms based on Local 
Computation compare with standard ones, the generality of the framework has 
a number of benehts. In particular, there are problems where different kinds of 
information are more suitably expressed using different logical representations, 
e.g. sets of models, constraints, clausal forms, terminological descriptions etc. A 
major difficulty of mixing representations is that moving between them tends 
to be computationally expensive; the Local Computation framework suggests 
good places for mixing representations (namely in sub-languages corresponding 
to small intersections between neighbouring sets in the hyperforest). 
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